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The  subject  of  probability  and  random  processes  is  an  important  one  for  a  variety  of 
disciplines.  Yet,  in  the  author’s  experience,  a  first  exposure  to  this  subject  can  cause 
difficulty  in  assimilating  the  material  and  even  more  so  in  applying  it  to  practical 
problems  of  interest.  The  goal  of  this  textbook  is  to  lessen  this  difficulty.  To  do 
so  we  have  chosen  to  present  the  material  with  an  emphasis  on  conceptualization. 
As  defined  by  Webster,  a  concept  is  “an  abstract  or  generic  idea  generalized  from 
particular  instances.”  This  embodies  the  notion  that  the  “idea”  is  something  we 
have  formulated  based  on  our  past  experience.  This  is  in  contrast  to  a  theorem , 
which  according  to  Webster  is  “an  idea  accepted  or  proposed  as  a  demonstrable 
truth”.  A  theorem  then  is  the  result  of  many  other  persons’  past  experiences,  which 
may  or  may  not  coincide  with  our  own.  In  presenting  the  material  we  prefer  to 
first  present  “particular  instances”  or  examples  and  then  generalize  using  a  defi¬ 
nition/theorem.  Many  textbooks  use  the  opposite  sequence,  which  undeniably  is 
cleaner  and  more  compact,  but  omits  the  motivating  examples  that  initially  led 
to  the  definition/theorem.  Furthermore,  in  using  the  definition/theorem- first  ap¬ 
proach,  for  the  sake  of  mathematical  correctness  multiple  concepts  must  be  presented 
at  once.  This  is  in  opposition  to  human  learning  for  which  “under  most  conditions, 
the  greater  the  number  of  attributes  to  be  bounded  into  a  single  concept,  the  more 
difficult  the  learning  becomes”1.  The  philosophical  approach  of  specific  examples 
followed  by  generalizations  is  embodied  in  this  textbook.  It  is  hoped  that  it  will 
provide  an  alternative  to  the  more  traditional  approach  for  exploring  the  subject  of 
probability  and  random  processes. 

To  provide  motivating  examples  we  have  chosen  to  use  MATLAB2,  which  is  a 
very  versatile  scientific  programming  language.  Our  own  engineering  students  at  the 
University  of  Rhode  Island  are  exposed  to  MATLAB  as  freshmen  and  continue  to  use 
it  throughout  their  curriculum.  Graduate  students  who  have  not  been  previously 
introduced  to  MATLAB  easily  master  its  use.  The  pedagogical  utility  of  using 
MATLAB  is  that: 

1.  Specific  computer  generated  examples  can  be  constructed  to  provide  motivation 
for  the  more  general  concepts  to  follow. 


1  Eli  Saltz,  The  Cognitive  Basis  of  Human  Learning ,  Dorsey  Press,  Homewood,  IL,  1971. 
2 Registered  trademark  of  TheMathWorks,  Inc. 
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2.  Inclusion  of  computer  code  within  the  text  allows  the  reader  to  interpret  the 

mathematical  equations  more  easily  by  seeing  them  in  an  alternative  form. 

3.  Homework  problems  based  on  computer  simulations  can  be  assigned  to  illustrate 

and  reinforce  important  concepts. 

4.  Computer  experimentation  by  the  reader  is  easily  accomplished. 

5.  Typical  results  of  probabilistic-based  algorithms  can  be  illustrated. 

6.  Real-world  problems  can  be  described  and  “solved”  by  implementing  the  solution 

in  code. 

Many  MATLAB  programs  and  code  segments  have  been  included  in  the  book.  In 
fact,  most  of  the  figures  were  generated  using  MATLAB.  The  programs  and  code 
segments  listed  within  the  book  are  available  in  the  file  probbookjnatlab.code .  tex, 
which  can  be  found  at  http://www.ele.uri.edu/faculty/kay/New%20web/Books.htm. 
The  use  of  MATLAB,  along  with  a  brief  description  of  its  syntax,  is  introduced  early 
in  the  book  in  Chapter  2.  It  is  then  immediately  applied  to  simulate  outcomes  of 
random  variables  and  to  estimate  various  quantities  such  as  means,  variances,  prob¬ 
ability  mass  functions,  etc.  even  though  these  concepts  have  not  as  yet  been  formally 
introduced.  This  chapter  sequencing  is  purposeful  and  is  meant  to  expose  the  reader 
to  some  of  the  main  concepts  that  will  follow  in  more  detail  later.  In  addition, 
the  reader  will  then  immediately  be  able  to  simulate  random  phenomena  to  learn 
through  doing,  in  accordance  with  our  philosophy.  In  summary,  we  believe  that 
the  incorporation  of  MATLAB  into  the  study  of  probability  and  random  processes 
provides  a  “hands-on”  approach  to  the  subject  and  promotes  better  understanding. 

Other  pedagogical  features  of  this  textbook  are  the  discussion  of  discrete  random 
variables  first  to  allow  easier  assimilation  of  the  concepts  followed  by  a  parallel  dis¬ 
cussion  for  continuous  random  variables.  Although  this  entails  some  redundancy,  we 
have  found  less  confusion  on  the  part  of  the  student  using  this  approach.  In  a  similar 
vein,  we  first  discuss  scalar  random  variables,  then  bivariate  (or  two-dimensional) 
random  variables,  and  finally  TV-dimensional  random  variables,  reserving  separate 
chapters  for  each.  All  chapters,  except  for  the  introductory  chapter,  begin  with  a 
summary  of  the  important  concepts  and  point  to  the  main  formulas  of  the  chap¬ 
ter,  and  end  with  a  real-world  example.  The  latter  illustrates  the  utility  of  the 
material  just  studied  and  provides  a  powerful  motivation  for  further  study.  It  also 
will,  hopefully,  answer  the  ubiquitous  question  “Why  do  we  have  to  study  this?”. 
We  have  tried  to  include  real-world  examples  from  many  disciplines  to  indicate  the 
wide  applicability  of  the  material  studied.  There  are  numerous  problems  in  each 
chapter  to  enhance  understanding  with  some  answers  listed  in  Appendix  E.  The 
problems  consist  of  four  types.  There  are  “formula”  problems,  which  are  simple  ap¬ 
plications  of  the  important  formulas  of  the  chapter;  “word”  problems,  which  require 
a  problem-solving  capability;  and  “theoretical”  problems,  which  are  more  abstract 
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and  mathematically  demanding;  and  finally,  there  are  “computer”  problems,  which 
are  either  computer  simulations  or  involve  the  application  of  computers  to  facilitate 
analytical  solutions.  A  complete  solutions  manual  for  all  the  problems  is  available 
to  instructors  from  the  author  upon  request.  Finally,  we  have  provided  warnings  on 
how  to  avoid  common  errors  as  well  as  in-line  explanations  of  equations  within  the 
derivations  for  clarification. 

The  book  was  written  mainly  to  be  used  as  a  first-year  graduate  level  course 
in  probability  and  random  processes.  As  such,  we  assume  that  the  student  has 
had  some  exposure  to  basic  probability  and  therefore  Chapters  3-11  can  serve  as 
a  review  and  a  summary  of  the  notation.  We  then  will  cover  Chapters  12-15  on 
probability  and  selected  chapters  from  Chapters  16-22  on  random  processes.  This 
book  can  also  be  used  as  a  self-contained  introduction  to  probability  at  the  senior 
undergraduate  or  graduate  level.  It  is  then  suggested  that  Chapters  1-7,  10,  11  be 
covered.  Finally,  this  book  is  suitable  for  self-study  and  so  should  be  useful  to  the 
practitioner  as  well  as  the  student.  The  necessary  background  that  has  been  assumed 
is  a  knowledge  of  calculus  (a  review  is  included  in  Appendix  B);  some  linear/matrix 
algebra  (a  review  is  provided  in  Appendix  C);  and  linear  systems,  which  is  necessary 
only  for  Chapters  18-20  (although  Appendix  D  has  been  provided  to  summarize  and 
illustrate  the  important  concepts). 

The  author  would  like  to  acknowledge  the  contributions  of  the  many  people  who 
over  the  years  have  provided  stimulating  discussions  of  teaching  and  research  prob¬ 
lems  and  opportunities  to  apply  the  results  of  that  research.  Thanks  are  due  to  my 
colleagues  L.  Jackson,  R.  Kumaresan,  L.  Pakula,  and  P.  Swaszek  of  the  University 
of  Rhode  Island.  A  debt  of  gratitude  is  owed  to  all  my  current  and  former  graduate 
students.  They  have  contributed  to  the  final  manuscript  through  many  hours  of 
pedagogical  and  research  discussions  as  well  as  by  their  specific  comments  and  ques¬ 
tions.  In  particular,  Lin  Huang  and  Cuichun  Xu  proofread  the  entire  manuscript  and 
helped  with  the  problem  solutions,  while  Russ  Costa  provided  feedback.  Lin  Huang 
also  aided  with  the  intricacies  of  LaTex  while  Lisa  Kay  and  Jason  Berry  helped  with 
the  artwork  and  to  demystify  the  workings  of  Adobe  Illustrator  10. 3  The  author 
is  indebted  to  the  many  agencies  and  program  managers  who  have  sponsored  his 
research,  including  the  Naval  Undersea  Warfare  Center,  the  Naval  Air  Warfare  Cen¬ 
ter,  the  Air  Force  Office  of  Scientific  Research,  and  the  Office  of  Naval  Research. 
As  always,  the  author  welcomes  comments  and  corrections,  which  can  be  sent  to 
kay@ele.uri.edu. 


Steven  M.  Kay 
University  of  Rhode  Island 
Kingston,  RI  02881 
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Chapter  1 


Introduction 


1.1  What  Is  Probability? 

Probability  as  defined  by  Webster’s  dictionary  is  “the  chance  that  a  given  event  will 
occur”.  Examples  that  we  are  familiar  with  are  the  probability  that  it  will  rain 
the  next  day  or  the  probability  that  you  will  win  the  lottery.  In  the  first  example, 
there  are  many  factors  that  affect  the  weather — so  many,  in  fact,  that  we  cannot  be 
certain  that  it  will  or  will  not  rain  the  following  day.  Hence,  as  a  predictive  tool  we 
usually  assign  a  number  between  0  and  1  (or  between  0%  and  100%)  indicating  our 
degree  of  certainty  that  the  event,  rain,  will  occur.  If  we  say  that  there  is  a  30% 
chance  of  rain,  we  believe  that  if  identical  conditions  prevail,  then  3  times  out  of  10, 
rain  will  occur  the  next  day.  Alternatively,  we  believe  that  the  relative  frequency  of 
rain  is  3/10.  Note  that  if  the  science  of  meteorology  had  accurate  enough  models, 
then  it  is  conceivable  that  we  could  determine  exactly  whether  rain  would  or  would 
not  occur.  Or  we  could  say  that  the  probability  is  either  0  or  1.  Unfortunately,  we 
have  not  progressed  that  far.  In  the  second  example,  winning  the  lottery,  our  chance 
of  success,  assuming  a  fair  drawing,  is  just  one  out  of  the  number  of  possible  lottery 
number  sequences.  In  this  case,  we  are  uncertain  of  the  outcome,  not  because  of  the 
inaccuracy  of  our  model,  but  because  the  experiment  has  been  designed  to  produce 
uncertain  results. 

The  common  thread  of  these  two  examples  is  the  presence  of  a  random  experi¬ 
ment ,  a  set  of  outcomes ,  and  the  probabilities  assigned  to  these  outcomes.  We  will 
see  later  that  these  attributes  are  common  to  all  probabilistic  descriptions.  In  the 
lottery  example,  the  experiment  is  the  drawing,  the  outcomes  are  the  lottery  num¬ 
ber  sequences,  and  the  probabilities  assigned  are  1/iV,  where  N  =  total  number  of 
lottery  number  sequences.  Another  common  thread,  which  justifies  the  use  of  prob¬ 
abilistic  methods,  is  the  concept  of  statistical  regularity.  Although  we  may  never 
be  able  to  predict  with  certainty  the  outcome  of  an  experiment,  we  are,  nonethe¬ 
less,  able  to  predict  “averages” .  For  example,  the  average  rainfall  in  the  summer  in 
Rhode  Island  is  9.76  inches,  as  shown  in  Figure  1.1,  while  in  Arizona  it  is  only  4.40 
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1900  1920  1940  1960  I960  2000 

Year 


Figure  1.1:  Annual  summer  rainfall  in  Rhode  Island  from  1895  to  2002 
[NOAA/NCDC  2003]. 


1900  1920  1940  I960  1980  2000 

Year 

Figure  1.2:  Annual  summer  rainfall  in  Arizona  from  1895  to  2002  [NOAA/NCDC 
2003]. 

inches,  as  shown  in  Figure  1.2.  It  is  clear  that  the  decision  to  plant  certain  types 
of  crops  could  be  made  based  on  these  averages.  This  is  not  to  say,  however,  that 
we  can  predict  the  rainfall  amounts  for  any  given  summer.  For  instance,  in  1999 
the  summer  rainfall  in  Rhode  Island  was  only  4.5  inches  while  in  1984  the  summer 
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rainfall  in  Arizona  was  7.3  inches.  A  somewhat  more  controlled  experiment  is  the 
repeated  tossing  of  a  fair  coin  (one  that  is  equally  likely  to  come  up  heads  or  tails). 
We  would  expect  about  50  heads  out  of  100  tosses,  but  of  course,  we  could  not 
predict  the  outcome  of  any  one  particular  toss.  An  illustration  of  this  is  shown  in 
Figure  1.3.  Note  that  53  heads  were  obtained  in  this  particular  experiment.  This 
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Toss 


Figure  1.3:  Outcomes  for  repeated  fair  coin  tossings. 

example,  which  is  of  seemingly  little  relevance  to  physical  reality,  actually  serves  as 
a  good  model  for  a  variety  of  random  phenomena.  We  will  explore  one  example  in 
the  next  section. 

In  summary,  probability  theory  provides  us  with  the  ability  to  predict  the  be¬ 
havior  of  random  phenomena  in  the  “long  run.”  To  the  extent  that  this  information 
is  useful,  probability  can  serve  as  a  valuable  tool  for  assessment  and  decision  mak¬ 
ing.  Its  application  is  widespread,  encountering  use  in  all  fields  of  scientific  endeavor 
such  as  engineering,  medicine,  economics,  physics,  and  others  (see  references  at  end 
of  chapter). 

1.2  Types  of  Probability  Problems 

Because  of  the  mathematics  required  to  determine  probabilities,  probabilistic  meth¬ 
ods  are  divided  into  two  distinct  types,  discrete  and  continuous.  A  discrete  approach 
is  used  when  the  number  of  experimental  outcomes  is  finite  (or  infinite  but  count¬ 
able  as  illustrated  in  Problem  1.7).  For  example,  consider  the  number  of  persons 
at  a  business  location  that  are  talking  on  their  respective  phones  anytime  between 
9:00  AM  and  9:10  AM.  Clearly,  the  possible  outcomes  are  0, 1, . . . ,  N,  where  N  is 
the  number  of  persons  in  the  office.  On  the  other  hand,  if  we  are  interested  in  the 
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length  of  time  a  particular  caller  is  on  the  phone  during  that  time  period,  then  the 
outcomes  may  be  anywhere  from  0  to  T  minutes,  where  T  —  10.  Now  the  outcomes 
are  infinite  in  number  since  they  lie  within  the  interval  [0,  T\.  In  the  first  case,  since 
the  outcomes  are  discrete  (and  finite),  we  can  assign  probabilities  to  the  outcomes 
{0, 1, . . .  ,  N}.  An  equiprobable  assignment  would  be  to  assign  each  outcome  a  prob¬ 
ability  of  l/(N  + 1).  In  the  second  case,  the  outcomes  are  continuous  (and  therefore 
infinite)  and  so  it  is  not  possible  to  assign  a  nonzero  probability  to  each  outcome 
(see  Problem  1.6). 

We  will  henceforth  delineate  between  probabilities  assigned  to  discrete  outcomes 
and  those  assigned  to  continuous  outcomes,  with  the  discrete  case  always  discussed 
first.  The  discrete  case  is  easier  to  conceptualize  and  to  describe  mathematically.  It 
will  be  important  to  keep  in  mind  which  case  is  under  consideration  since  otherwise, 
certain  paradoxes  may  result  (as  well  as  much  confusion  on  the  part  of  the  student!). 


1.3  Probabilistic  Modeling 

Probability  models  are  simplified  approximations  to  reality.  They  should  be  detailed 
enough  to  capture  important  characteristics  of  the  random  phenomenon  so  as  to  be 
useful  as  a  prediction  device,  but  not  so  detailed  so  as  to  produce  an  unwieldy 
model  that  is  difficult  to  use  in  practice.  The  example  of  the  number  of  telephone 
callers  can  be  modeled  by  assigning  a  probability  p  to  each  person  being  on  the 
phone  anytime  in  the  given  10-minute  interval  and  assuming  that  whether  one  or 
more  persons  are  on  the  phone  does  not  affect  the  probability  of  others  being  on 
the  phone.  One  can  thus  liken  the  event  of  being  on  the  phone  to  a  coin  toss — 
if  heads,  a  person  is  on  the  phone  and  if  tails,  a  person  is  not  on  the  phone.  If 
there  are  N  =  4  persons  in  the  office,  then  the  experimental  outcome  is  likened  to 
4  coin  tosses  (either  in  succession  or  simultaneously — it  makes  no  difference  in  the 
modeling).  We  can  then  ask  for  the  probability  that  3  persons  are  on  the  phone 
by  determining  the  probability  of  3  heads  out  of  4  coin  tosses.  The  solution  to  this 
problem  will  be  discussed  in  Chapter  3,  where  it  is  shown  that  the  probability  of  k 
heads  out  of  N  coin  tosses  is  given  by 

p[k]  =  (Nk)pk^-p^k  (i-i) 

where 

/ N\  _  N\ 

V  k  )  ~  (N  —  k)\k\ 

for  k  =  0, 1, . . . ,  iV,  and  where  Ml  —  1  •  2  •  3  •  •  •  M  for  M  a  positive  integer  and  by 
definition  0!  =  1.  For  our  example,  if  p  =  0.75  (we  have  a  group  of  telemarketers) 
and  N  =  4  a  compilation  of  the  probabilities  is  shown  in  Figure  1.4.  It  is  seen  that 
the  probability  that  three  persons  are  on  the  phone  is  0.42.  Generally,  the  coin  toss 
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Figure  1.4:  Probabilities  for  N  =  4  coin  tossings  with  p  =  0.75. 


model  is  a  reasonable  one  for  this  type  of  situation.  It  will  be  poor,  however,  if  the 
assumptions  are  invalid.  Some  practical  objections  to  the  model  might  be: 

1.  Different  persons  have  different  probabilities  p  (an  eager  telemarketer  versus  a 

not  so  eager  one). 

2.  The  probability  of  one  person  being  on  the  phone  is  affected  by  whether  his 

neighbor  is  on  the  phone  (the  two  neighbors  tend  to  talk  about  their  planned 
weekends),  i.e.,  the  events  are  not  “independent”. 

3.  The  probability  p  changes  over  time  (later  in  the  day  there  is  less  phone  activity 

due  to  fatigue). 

To  accommodate  these  objections  the  model  can  be  made  more  complex.  In  the 
end,  however,  the  “more  accurate”  model  may  become  a  poorer  predictor  if  the 
additional  information  used  is  not  correct.  It  is  generally  accepted  that  a  model 
should  exhibit  the  property  of  “parsimony” — in  other  words,  it  should  be  as  simple 
as  possible. 

The  previous  example  had  discrete  outcomes.  For  continuous  outcomes  a  fre¬ 
quently  used  probabilistic  model  is  the  Gaussian  or  “bell” -shaped  curve.  For  the 
modeling  of  the  length  of  time  T  a  caller  is  on  the  phone  it  is  not  appropriate  to 
ask  for  the  probability  that  T  will  be  exactly ,  for  example,  5  minutes.  This  is  be¬ 
cause  this  probability  will  be  zero  (see  Problem  1.6).  Instead,  we  inquire  as  to  the 
probability  that  T  will  be  between  5  and  6  minutes.  This  question  is  answered  by 
determining  the  area  under  the  Gaussian  curve  shown  in  Figure  1.5.  The  form  of 
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the  curve  is  given  by 


1 


Pr(t)  =  -y==  exp 


V2 


7 r 


kt-n 


—  oc  <  t  <  oo 


(1.2) 


and  although  defined  for  all  £,  it  is  physically  meaningful  only  for  0  <  t  <  Tmax, 


Figure  1.5:  Gaussian  or  “bell”-shaped  curve. 

where  Tmax  =  10  for  the  current  example.  Since  the  area  under  the  curve  for  times 
less  than  zero  or  greater  than  Tmax  =  10  is  nearly  zero,  this  model  is  a  reasonable 
approximation  to  physical  reality.  The  curve  has  been  chosen  to  be  centered  about 
t  =  7  to  relect  an  “average”  time  on  the  phone  of  7  minutes  for  a  given  caller.  Also, 
note  that  we  let  t  denote  the  actual  value  of  the  random  time  T.  Now,  to  determine 
the  probability  that  the  caller  will  be  on  the  phone  for  between  5  and  6  minutes  we 
integrate  pr(t)  over  this  interval  to  yield 


P[5<T<6}=  [  pr(t)dt  =  0.1359.  (1.3) 

The  value  of  the  integral  must  be  numerically  determined.  Knowing  the  function 
pr(t)  allows  us  to  determine  the  probability  for  any  interval.  (It  is  called  the  proba¬ 
bility  density  function  (PDF)  and  is  the  probability  per  unit  length.  The  PDF  will 
be  discussed  in  Chapter  10.)  Also,  it  is  apparent  from  Figure  1.5  that  phone  usage 
of  duration  less  than  4  minutes  or  greater  than  10  minutes  is  highly  unlikely.  Phone 
usage  in  the  range  of  7  minutes,  on  the  other  hand,  is  most  probable.  As  before, 
some  objections  might  be  raised  as  to  the  accuracy  of  this  model.  A  particularly 
lazy  worker  could  be  on  the  phone  for  only  3  minutes,  as  an  example. 
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In  this  book  we  will  henceforth  assume  that  the  models,  which  are  mathematical 
in  nature,  are  perfect  and  thus  can  be  used  to  determine  probabilities.  In  practice, 
the  user  must  ultimately  choose  a  model  that  is  a  reasonable  one  for  the  application 
of  interest. 

1.4  Analysis  versus  Computer  Simulation 

In  the  previous  section  we  saw  how  to  compute  probabilities  once  we  were  given 
certain  probability  functions  such  as  (1.1)  for  the  discrete  case  and  (1.2)  for  the 
continuous  case.  For  many  practical  problems  it  is  not  possible  to  determine  these 
functions.  However,  if  we  have  a  model  for  the  random  phenomenon,  then  we 
may  carry  out  the  experiment  a  large  number  of  times  to  obtain  an  approximate 
probability.  For  example,  to  determine  the  probability  of  3  heads  in  4  tosses  of  a 
coin  with  probability  of  heads  being  p  =  0.75,  we  toss  the  coin  four  times  and  count 
the  number  of  heads,  say  x\  —  2.  Then,  we  repeat  the  experiment  by  tossing  the 
coin  four  more  times,  yielding  #2  —  1  head.  Continuing  in  this  manner  we  execute 
a  succession  of  1000  experiments  to  produce  the  sequence  of  number  of  heads  as 
{rr  i ,  £2, . . . ,  £1000  }•  Then,  to  determine  the  probability  of  3  heads  we  use  a  relative 
frequency  interpretation  of  probability  to  yield 

Number  of  times  3  heads  observed 

1000 

Indeed,  early  on  probabilists  did  exactly  this,  although  it  was  extremely  tedious.  It 
is  therefore  of  utmost  importance  to  be  able  to  simulate  this  procedure.  With  the 
advent  of  the  modern  digital  computer  this  is  now  possible.  A  digital  computer 
has  no  problem  performing  a  calculation  once,  100  times,  or  1,000,000  times.  What 
is  needed  to  implement  this  approach  is  a  means  to  simulate  the  toss  of  a  coin. 
Fortunately,  this  is  quite  easy  as  most  scientific  software  packages  have  built-in 
random  number  generators.  In  MATLAB,  for  example,  a  number  in  the  interval 
(0,1)  can  be  produced  with  the  simple  statement  x=rand(l,l).  The  number  is 
chosen  “at  random”  so  that  it  is  equally  likely  to  be  anywhere  in  the  (0, 1)  interval. 
As  a  result,  a  number  in  the  interval  (0, 1/2]  will  be  observed  with  probability  1/2 
and  a  number  in  the  remaining  part  of  the  interval  (1/2,1)  also  with  probability 
1/2.  Likewise,  a  number  in  the  interval  (0,0.75]  will  be  observed  with  probability 
p  =  0.75.  A  computer  simulation  of  the  number  of  persons  in  the  office  on  the 
telephone  can  thus  be  implemented  with  the  MATLAB  code  (see  Appendix  2A  for 
a  brief  introduction  to  MATLAB): 

number=0 ; 

for  i=l:4  °/0  set  up  simulation  for  4  coin  tosses 
if  rand(l,l)<0.75  °/0  toss  coin  with  p=0.75 
x(i,l)=l;  °/0  head 
else 
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x(i,l)=0;  °/0  tail 
end 

number=number+x(i , 1) ;  °/0  count  number  of  heads 
end 

Repeating  this  code  segment  1000  times  will  result  in  a  simulation  of  the  previous 
experiment. 

Similarly,  for  a  continuous  outcome  experiment  we  require  a  means  to  generate 
a  continuum  of  outcomes  on  a  digital  computer.  Of  course,  strictly  speaking  this  is 
not  possible  since  digital  computers  can  only  provide  a  finite  set  of  numbers,  which 
is  determined  by  the  number  of  bits  in  each  word.  But  if  the  number  of  bits  is 
large  enough,  then  the  approximation  is  adequate.  For  example,  with  64  bits  we 
could  represent  264  numbers  between  0  and  1,  so  that  neighboring  numbers  would 
be  2-64  =  5  x  10-20  apart.  With  this  ability  MATLAB  can  produce  numbers  that 
follow  a  Gaussian  curve  by  invoking  the  statement  x=randn(l,l). 

Throughout  the  text  we  will  use  MATLAB  for  examples  and  also  exercises. 
However,  any  modern  scientific  software  package  can  be  used. 

1.5  Some  Notes  to  the  Reader 

The  notation  used  in  this  text  is  summarized  in  Appendix  A.  Note  that  boldface 
type  is  reserved  for  vectors  and  matrices  while  regular  face  type  will  denote  scalar 
quantities.  All  other  symbolism  is  defined  within  the  context  of  the  discussion.  Also, 
the  reader  will  frequently  be  warned  of  potential  “pitfalls”.  Common  misconcep¬ 
tions  leading  to  student  errors  will  be  described  and  noted.  The  pitfall  or  caution 
symbol  shown  below  should  be  heeded. 

A 

The  problems  are  of  four  types:  computational  or  formula  applications,  word 
problems,  computer  exercises,  and  theoretical  exercises.  Computational  or  formula 
(denoted  by  f)  problems  are  straightforward  applications  of  the  various  formulas  of 
the  chapter,  while  word  problems  (denoted  by  w)  require  a  more  complete  assimi¬ 
lation  of  the  material  to  solve  the  problem.  Computer  exercises  (denoted  by  c)  will 
require  the  student  to  either  use  a  computer  to  solve  a  problem  or  to  simulate  the 
analytical  results.  This  will  enhance  understanding  and  can  be  based  on  MATLAB, 
although  equivalent  software  may  be  used.  Finally,  theoretical  exercises  (denoted  by 
t)  will  serve  to  test  the  student’s  analytical  skills  as  well  as  to  provide  extensions  to 
the  material  of  the  chapter.  They  are  more  challenging.  Answers  to  selected  prob¬ 
lems  are  given  in  Appendix  E.  Those  problems  for  which  the  answers  are  provided 
are  noted  in  the  problem  section  with  the  symbol  (o)- 

The  version  of  MATLAB  used  in  this  book  is  5.2,  although  newer  versions 
should  provide  identical  results.  Many  MATLAB  outputs  that  are  used  for  the 
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text  figures  and  for  the  problem  solutions  rely  on  random  number  generation.  To 
match  your  results  against  those  shown  in  the  figures  and  the  problem  solutions,  the 
same  set  of  random  numbers  can  be  generated  by  using  the  MATLAB  statements 
rand(  ’  state 3 , 0)  and  randn( 3  state 3 , 0)  at  the  beginning  of  each  program.  These 
statements  will  initialize  the  random  number  generators  to  produce  the  same  set  of 
random  numbers.  Finally,  the  MATLAB  programs  and  code  segments  given  in  the 
book  are  indicated  by  the  “typewriter”  font,  for  example,  x=randn(l,l). 

There  are  a  number  of  other  textbooks  that  the  reader  may  wish  to  consult. 
They  are  listed  in  the  following  reference  list,  along  with  some  comments  on  their 
contents. 

Davenport,  W.B.,  Probability  and  Random  Processes ,  McGraw-Hill,  New  York, 
1970.  (Excellent  introductory  text.) 

Feller,  W.,  An  Introduction  to  Probability  Theory  and  its  Applications ,  Vols.  1, 
2,  John  Wiley,  New  York,  1950.  (Definitive  work  on  probability — requires 
mature  mathematical  knowledge.) 

Hoel,  P.G.,  S.C.  Port,  C.J.  Stone,  Introduction  to  Probability  Theory ,  Houghton 
Mifflin  Co.,  Boston,  1971.  (Excellent  introductory  text  but  limited  to  proba¬ 
bility.) 

Leon-Garcia,  A.,  Probability  and  Random  Processes  for  Electrical  Engineering , 
Addison- Wesley,  Reading,  MA,  1994.  (Excellent  introductory  text.) 

Parzen,  E.,  Modern  Probability  Theory  and  Its  Applications ,  John  Wiley,  New  York, 
1960.  (Classic  text  in  probability — useful  for  all  disciplines). 

Parzen,  E.,  Stochastic  Processes ,  Holden-Day,  San  Francisco,  1962.  (Most  useful 
for  Markov  process  descriptions.) 

Papoulis,  A.,  Probability,  Random  Variables ,  and  Stochastic  Processes ,  McGraw- 
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Problems 

1.1  (o)  (w)  A  fair  coin  is  tossed.  Identify  the  random  experiment,  the  set  of 

outcomes,  and  the  probabilities  of  each  possible  outcome. 

1.2  (w)  A  card  is  chosen  at  random  from  a  deck  of  52  cards.  Identify  the  ran¬ 

dom  experiment,  the  set  of  outcomes,  and  the  probabilities  of  each  possible 
outcome. 

1.3  (w)  A  fair  die  is  tossed  and  the  number  of  dots  on  the  face  noted.  Identify  the 

random  experiment,  the  set  of  outcomes,  and  the  probabilities  of  each  possible 
outcome. 

1.4  (w)  It  is  desired  to  predict  the  annual  summer  rainfall  in  Rhode  Island  for  2010. 

If  we  use  9.76  inches  as  our  prediction,  how  much  in  error  might  we  be,  based 
on  the  past  data  shown  in  Figure  1.1?  Repeat  the  problem  for  Arizona  by 
using  4.40  inches  as  the  prediction. 
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1.5  (o)  (w)  Determine  whether  the  following  experiments  have  discrete  or  contin¬ 

uous  outcomes: 

a.  Throw  a  dart  with  a  point  tip  at  a  dartboard. 

b.  Toss  a  die. 

c.  Choose  a  lottery  number. 

d.  Observe  the  outdoor  temperature  using  an  analog  thermometer. 

e.  Determine  the  current  time  in  hours,  minutes,  seconds,  and  AM  or  PM. 

1.6  (w)  An  experiment  has  N  =  10  outcomes  that  are  equally  probable.  What  is 

the  probability  of  each  outcome?  Now  let  N  =  1000  and  also  N  =  1,000,000 
and  repeat.  What  happens  as  N  — >  oo? 

1.7  (^)  (f)  Consider  an  experiment  with  possible  outcomes  {1,2,3,...}.  If  we 

assign  probabilities 

2^  k  =  1, 2, 3, . . . 

to  the  outcomes,  will  these  probabilties  sum  to  one?  Can  you  have  an  infinite 
number  of  outcomes  but  still  assign  nonzero  probabilities  to  each  outcome? 
Reconcile  these  results  with  that  of  Problem  1.6. 

1.8  (w)  An  experiment  consists  of  tossing  a  fair  coin  four  times  in  succession.  What 

are  the  possible  outcomes?  Now  count  up  the  number  of  outcomes  with  three 
heads.  If  the  outcomes  are  equally  probable,  what  is  the  probability  of  three 
heads?  Compare  your  results  to  that  obtained  using  (1.1). 

1.9  (w)  Perform  the  following  experiment  by  actually  tossing  a  coin  of  your  choice. 

Flip  the  coin  four  times  and  observe  the  number  of  heads.  Then,  repeat  this 
experiment  10  times.  Using  (1.1)  determine  the  probability  for  k  =  0,1,  2, 3, 4 
heads.  Next  use  (1.1)  to  determine  the  number  of  heads  that  is  most  proba¬ 
ble  for  a  single  experiment?  In  your  10  experiments  which  number  of  heads 
appeared  most  often? 

1.10  (o)  (w)  A  coin  is  tossed  12  times.  The  sequence  observed  is  the  12-tuple 
(H,,H,T,H,H,T,H,H,H,H,T,H).  Is  this  a  fair  coin?  Hint:  Determine 
P[k  =  9]  using  (1.1)  assuming  a  probability  of  heads  of  p  =  1/2. 

1.11  (t)  Prove  that  0P[fc]  =  1,  where  P[k]  is  given  by  (1.1).  Hint:  First  prove 
the  binomial  theorem 


(a  +  b)N 


akbN~k 
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by  induction  (see  Appendix  B).  Use  Pascal’s  “triangle”  rule 

1\  I M  —  1 

k )  ~  \  k  y  +  V  ^  —  i 

where 

=  0  k  <  0  and  k  >  M. 

1.12  (t)  If  fbpT(t)dt  is  the  probability  of  observing  T  in  the  interval  [a,  6],  what  is 

I-oo  Pr{t)dt7 

1.13  (^)  (f)  Using  (1.2)  what  is  the  probability  of  T  >  7?  Hint:  Observe  that 
PT(t)  is  symmetric  about  t  =  7. 

1.14  (0)(c)  Evaluate  the  integral 


by  using  the  approximation 


L 


n——L 


where  L  is  the  integer  closest  to  3/ A  (the  rounded  value),  for  A  =  0.1,  A  = 
0.01,  A  =  0.001. 

1.15  (c)  Simulate  a  fair  coin  tossing  experiment  by  modifying  the  code  given  in 
Section  1.4.  Using  1000  repetitions  of  the  experiment,  count  the  number  of 
times  three  heads  occur.  What  is  the  simulated  probability  of  obtaining  three 
heads  in  four  coin  tosses?  Compare  your  result  to  that  obtained  using  (1.1). 

1.16  (c)  Repeat  Problem  1.15  but  instead  consider  a  biased  coin  with  p  =  0.75. 
Compare  your  result  to  Figure  1.4. 


Chapter  2 


Computer  Simulation 

2.1  Introduction 


Computer  simulation  of  random  phenomena  has  become  an  indispensable  tool  in 
modern  scientific  investigations.  So-called  Monte  Carlo  computer  approaches  are 
now  commonly  used  to  promote  understanding  of  probabilistic  problems.  In  this 
chapter  we  continue  our  discussion  of  computer  simulation,  first  introduced  in  Chap¬ 
ter  1,  and  set  the  stage  for  its  use  in  later  chapters.  Along  the  way  we  will  examine 
some  well  known  properties  of  random  events  in  the  process  of  simulating  their 
behavior.  A  more  formal  mathematical  description  will  be  introduced  later  but 
careful  attention  to  the  details  now,  will  lead  to  a  better  intuitive  understanding  of 
the  mathematical  definitions  and  theorems  to  follow. 


2.2  Summary 


This  chapter  is  an  introduction  to  computer  simulation  of  random  experiments.  In 
Section  2.3  there  are  examples  to  show  how  we  can  use  computer  simulation  to  pro¬ 
vide  counterexamples,  build  intuition,  and  lend  evidence  to  a  conjecture.  However, 
it  cannot  be  used  to  prove  theorems.  In  Section  2.4  a  simple  MATLAB  program  is 
given  to  simulate  the  outcomes  of  a  discrete  random  variable.  Section  2.5  gives  many 
examples  of  typical  computer  simulations  used  in  probability,  including  probability 
density  function  estimation,  probability  of  an  interval,  average  value  of  a  random 
variable,  probability  density  function  for  a  transformed  random  variable,  and  scat¬ 
ter  diagrams  for  multiple  random  variables.  Section  2.6  contains  an  application  of 
probability  to  the  “real- world”  example  of  a  digital  communication  system.  A  brief 
description  of  the  MATLAB  programming  language  is  given  in  Appendix  2A. 
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2.3  Why  Use  Computer  Simulation? 

A  computer  simulation  is  valuable  in  many  respects.  It  can  be  used 

a.  to  provide  counterexamples  to  proposed  theorems 

b.  to  build  intuition  by  experimenting  with  random  numbers 

c.  to  lend  evidence  to  a  conjecture. 

We  now  explore  these  uses  by  posing  the  following  question:  What  is  the  effect 
of  adding  together  the  numerical  outcomes  of  two  or  more  experiments,  i.e.,  what 
are  the  probabilities  of  the  summed  outcomes?  Specifically,  if  U\  represents  the 
outcome  of  an  experiment  in  which  a  number  from  0  to  1  is  chosen  at  random 
and  U2  is  the  outcome  of  an  experiment  in  which  another  number  is  also  chosen  at 
random  from  0  to  1,  what  are  the  probabilities  of  X  =  U\  +  C/2?  The  mathematical 
answer  to  this  question  is  given  in  Chapter  12  (see  Example  12.8),  although  at 
this  point  it  is  unknown  to  us.  Let’s  say  that  someone  asserts  that  there  is  a 
theorem  that  X  is  equally  likely  to  be  anywhere  in  the  interval  [0, 2].  To  see  if  this  is 
reasonable,  we  carry  out  a  computer  simulation  by  generating  values  of  U\  and  U2 
and  adding  them  together.  Then  we  repeat  this  procedure  M  times.  Next  we  plot  a 
histogram ,  which  gives  the  number  of  outcomes  that  fall  in  each  subinterval  within 
[0,2].  As  an  example  of  a  histogram  consider  the  M  —  8  possible  outcomes  for 
X  of  {1.7, 0.7, 1.2, 1.3, 1.8, 1.4, 0.6, 0.4}.  Choosing  the  four  subintervals  (also  called 
bins )  [0,0.5],  (0.5,1],  (1,1.5],  (1.5,2],  the  histogram  appears  in  Figure  2.1.  In  this 


Value  of  X 


Figure  2.1:  Example  of  a  histogram  for  a  set  of  8  numbers  in  [0,2]  interval, 
example,  2  outcomes  were  between  0.5  and  1  and  are  therefore  shown  by  the  bar 
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centered  at  0.75.  The  other  bars  are  similarly  obtained.  If  we  now  increase  the 
number  of  experiments  to  M  =  1000,  we  obtain  the  histogram  shown  in  Figure  2.2. 
Now  it  is  clear  that  the  values  of  X  are  not  equally  likely.  Values  near  one  appear 


Value  of  X 


Figure  2.2:  Histogram  for  sum  of  two  equally  likely  numbers,  both  chosen  in  interval 
[0,1]. 

to  be  much  more  probable.  Hence,  we  have  generated  a  “counterexample”  to  the 
proposed  theorem,  or  at  least  some  evidence  to  the  contrary. 

We  can  build  up  our  intuition  by  continuing  with  our  experimentation.  Attempt¬ 
ing  to  justify  the  observed  occurrences  of  X,  we  might  suppose  that  the  probabilities 
are  higher  near  one  because  there  are  more  ways  to  obtain  these  values.  If  we  con¬ 
trast  the  values  of  X  =  1  versus  X  —  2,  we  note  that  X  —  2  can  only  be  obtained 
by  choosing  U\  =  1  and  C^  =  1  but  X  =  1  can  be  obtained  from  Ui  =  U2  =  1/2 
or  U\  —  1/4,  C/2  =  3/4  or  U\  =  3/4,  C/2  =  1/4,  etc.  We  can  lend  credibility  to  this 
line  of  reasoning  by  supposing  that  U\  and  C/2  can  only  take  on  values  in  the  set 
{0, 0.25, 0.5, 0.75, 1}  and  finding  all  values  of  U\  +  C/2.  In  essence,  we  now  look  at  a 
simpler  problem  in  order  to  build  up  our  intuition.  An  enumeration  of  the  possible 
values  is  shown  in  Table  2.1  along  with  a  “histogram”  in  Figure  2.3.  It  is  clear 
now  that  the  probability  is  highest  at  X  =  1  because  the  number  of  combinations 
of  U\  and  C/2  that  will  yield  X  =  1  is  highest.  Hence,  we  have  learned  about  what 
happens  when  outcomes  of  experiments  are  added  together  by  employing  computer 
simulation. 

We  can  now  try  to  extend  this  result  to  the  addition  of  three  or  more  exper¬ 
imental  outcomes  via  computer  simulation.  To  do  so  define  X3  =  U\  4-  C/2  +  C/3 
and  X4  =  U\  +  C/2  +  C/3  +  C/4  and  repeat  the  simulation.  A  computer  simulation 
with  M  —  1000  trials  produces  the  histograms  shown  in  Figure  2.4.  It  appears  to 
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0.00 

0.25 

u2 

0.50 

0.75 

1.00 

0.00 

0.00 

0.25 

0.50 

0.75 

1.00 

0.25 

0.25 

0.50 

0.75 

1.00 

1.25 

Ui  0.50 

0.50 

0.75 

1.00 

1.25 

1.50 

0.75 

0.75 

1.00 

1.25 

1.50 

1.75 

1.00 

1.00 

1.25 

1.50 

1.75 

2.00 

Table  2.1:  Possible  values  for  X  =  U\  +  U2  for  intuition-building  experiment. 


Value  of  X 


Figure  2.3:  Histogram  for  X  for  intuition-building  experiment. 


bear  out  the  conjecture  that  the  most  probable  values  are  near  the  center  of  the 
[0, 3]  and  [0, 4]  intervals,  respectively.  Additionally,  the  histograms  appear  more  like 
a  bell-shaped  or  Gaussian  curve.  Hence,  we  might  now  conjecture ,  based  on  these 
computer  simulations,  that  as  we  add  more  and  more  experimental  outcomes  to¬ 
gether,  we  will  obtain  a  Gaussian-shaped  histogram.  This  is  in  fact  true,  as  will  be 
proven  later  (see  central  limit  theorem  in  Chapter  15).  Note  that  we  cannot  prove 
this  result  using  a  computer  simulation  but  only  lend  evidence  to  our  theory.  How¬ 
ever,  the  use  of  computer  simulations  indicates  what  we  need  to  prove,  information 
that  is  invaluable  in  practice.  In  summary,  computer  simulation  is  a  valuable  tool 
for  lending  credibility  to  conjectures,  building  intuition,  and  uncovering  new  results. 
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Value  of  X3 


Value  of  X\ 


(a)  Sum  of  3  LTs 


(b)  Sum  of  4  U’s 


Figure  2.4:  Histograms  for  addition  of  outcomes. 


Computer  simulations  cannot  be  used  to  prove  theorems. 


In  Figure  2.2,  which  displayed  the  outcomes  for  1000  trials,  is  it  possible  that  the 
computer  simulation  could  have  produced  500  outcomes  in  [0,0.5],  500  outcomes  in 
[1.5,2]  and  no  outcomes  in  (0.5, 1.5)?  The  answer  is  yes,  although  it  is  improbable. 
It  can  be  shown  that  the  probability  of  this  occuring  is 


1000  \ 
500  ) 


«  2.2  x  10“604 


(see  Problem  12.27). 


2.4  Computer  Simulation  of  Random  Phenomena 

In  the  previous  chapter  we  briefly  explained  how  to  use  a  digital  computer  to  simu¬ 
late  a  random  phenomenon.  We  now  continue  that  discussion  in  more  detail.  Then, 
the  following  section  applies  the  techniques  to  specific  problems  ecountered  in  prob¬ 
ability.  As  before,  we  will  distinguish  between  experiments  that  produce  discrete 
outcomes  from  those  that  produce  continuous  outcomes. 

We  first  define  a  random  variable  X  as  the  numerical  outcome  of  the  random 
experiment.  Typical  examples  are  the  number  of  dots  on  a  die  (discrete)  or  the 
distance  of  a  dart  from  the  center  of  a  dartboard  of  radius  one  (continuous).  The 
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random  variable  X  can  take  on  the  values  in  the  set  {1, 2, 3, 4, 5, 6}  for  the  first 
example  and  in  the  set  {r  :  0  <  r  <  1}  for  the  second  example.  We  denote 
the  random  variable  by  a  capital  letter ,  say  X,  and  its  possible  values  by  a  small 
letter,  say  X{  for  the  discrete  case  and  x  for  the  continuous  case.  The  distinction  is 
analogous  to  that  between  a  function  defined  as  g(x)  =  x1  and  the  values  y  =  g(x) 
that  g(x)  can  take  on. 

Now  it  is  of  interest  to  determine  various  properties  of  X.  To  do  so  we  use 
a  computer  simulation,  performing  many  experiments  and  observing  the  outcome 
for  each  experiment.  The  number  of  experiments,  which  is  sometimes  referred  to 
as  the  number  of  trials ,  will  be  denoted  by  M.  To  simulate  a  discrete  random 
variable  we  use  rand,  which  generates  a  number  at  random  within  the  (0, 1)  interval 
(see  Appendix  2 A  for  some  MATLAB  basics).  Assume  that  in  general  the  possible 
values  of  X  are  {sq,  #2,  •  •  • ,  %n}  with  probabilities  •  •  •  ,Pn}-  As  an  example, 

if  N  =  3  we  can  generate  M  values  of  X  by  using  the  following  code  segment  (which 
assumes  M,xl,x2,x3,pl,p2,p3  have  been  previously  assigned): 

for  i=l:M 
u=rand(l , 1)  ; 
if  u<=pl 
x(i , l)=xl ; 

elseif  u>pl  &  u<=pl+p2 
x(i , l)=x2 ; 
elseif  u>pl+p2 
x(i , l)=x3; 
end 
end 

After  this  code  is  executed,  we  will  have  generated  M  values  of  the  random  variable 
X.  Note  that  the  values  of  X  so  obtained  are  termed  the  outcomes  or  realizations 
of  X.  The  extension  to  any  number  N  of  possible  values  is  immediate.  For  a 
continuous  random  variable  X  that  is  Gaussian  we  can  use  the  code  segment: 

for  i=l:M 

x(i , l)=randn(l , 1) ; 
end 

or  equivalently  x=randn(M,  1) .  Again  at  the  conclusion  of  this  code  segment  we  will 
have  generated  M  realizations  of  X.  Later  we  will  see  how  to  generate  realizations 
of  random  variables  whose  PDFs  are  not  Gaussian  (see  Section  10.9). 

2.5  Determining  Characteristics  of  Random  Variables 

There  are  many  ways  to  characterize  a  random  variable.  We  have  already  alluded  to 
the  probability  of  the  outcomes  in  the  discrete  case  and  the  PDF  in  the  continuous 
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case.  To  be  more  precise  consider  a  discrete  random  variable,  such  as  that  describing 
the  outcome  of  a  coin  toss.  If  we  toss  a  coin  and  let  X  be  1  if  a  head  is  observed 
and  let  X  be  0  if  a  tail  is  observed,  then  the  probabilities  are  defined  to  be  p  for 
X  =  x\  =  1  and  1  —  p  for  X  =  X2  =  0.  The  probability  p  of  X  =  1  can  be  thought 
of  as  the  relative  frequency  of  the  outcome  of  heads  in  a  long  succession  of  tosses. 
Hence,  to  determine  the  probability  of  heads  we  could  toss  a  coin  a  large  number 
of  times  and  estimate  p  by  the  number  of  observed  heads  divided  by  the  number 
of  tosses.  Using  a  computer  to  simulate  this  experiment,  we  might  inquire  as  to 
the  number  of  tosses  that  would  be  necessary  to  obtain  an  accurate  estimate  of  the 
probability  of  heads.  Unfortunately,  this  is  not  easily  answered.  A  practical  means, 
though,  is  to  increase  the  number  of  tosses  until  the  estimate  so  computed  converges 
to  a  fixed  number.  A  computer  simulation  is  shown  in  Figure  2.5  where  the  estimate 


Figure  2.5:  Estimate  of  probability  of  heads  for  various  number  of  coin  tosses. 

appears  to  converge  to  about  0.4.  Indeed,  the  true  value  (that  value  used  in  the 
simulation)  was  p  =  0.4.  It  is  also  seen  that  the  estimate  of  p  is  slightly  higher 
than  0.4.  This  is  due  to  the  slight  imperfections  in  the  random  number  generator 
as  well  as  computational  errors.  Increasing  the  number  of  trials  will  not  improve 
the  results.  We  next  describe  some  typical  simulations  that  will  be  useful  to  us. 
To  illustrate  the  various  simulations  we  will  use  a  Gaussian  random  variable  with 
realizations  generated  using  randn(l,l).  Its  PDF  is  shown  in  Figure  2.6. 

Example  2.1  -  Probability  density  function 

A  PDF  may  be  estimated  by  first  finding  the  histogram  and  then  dividing  the 
number  of  outcomes  in  each  bin  by  M,  the  total  number  of  realizations,  to  obtain 
the  probability.  Then  to  obtain  the  PDF  px{x)  recall  that  the  probability  of  X 
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x 


Figure  2.6:  Gaussian  probability  density  function. 


taking  on  a  value  in  an  interval  is  found  as  the  area  under  the  PDF  of  that  interval 
(see  Section  1.3).  Thus, 


P[a  <  X  <  b\  =  f  px{x)dx  (2.1) 

J  a 

and  if  a  =  xq  —  Ax/2  and  b  =  xq  +  Ax/2,  where  Ax  is  small,  then  (2.1)  becomes 

P[x0  —  Ax/2  <  X  <  xq  +  Ax/2]  «  px{xo)Ax 
and  therefore  the  PDF  at  x  —  xo  is  approximately 

f  ^  ^  P[x o  —  Ax/2  <  X  <  xo  +  Ax/2] 
px[x o)  w  Ax  ' 

Hence,  we  need  only  divide  the  estimated  probability  by  the  bin  width  Ax.  Also, 
note  that  as  claimed  in  Chapter  1,  px(x)  is  seen  to  be  the  probability  per  unit  length. 
In  Figure  2.7  is  shown  the  estimated  PDF  for  a  Gaussian  random  variable  as  well 
as  the  true  PDF  as  given  in  Figure  2.6.  The  MATLAB  code  used  to  generate  the 
figure  is  also  shown. 

0 


Example  2.2  —  Probability  of  an  interval 

To  determine  P[a  <  X  <  b]  we  need  only  generate  M  realizations  of  X,  then  count 
the  number  of  outcomes  that  fall  into  the  [a,  b\  interval  and  divide  by  M.  Of  course 
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x 


randn(J state } ,0) 
x=randn( 1000,1) ; 
bincenters= [-3 .  5 : 0 . 5 : 3 . 5]  ’ ; 
bins=length(bincenters) ; 
h=zeros (bins , 1) ; 
for  i=l :length(x) 
for  k=l:bins 

if  x(i)>bincenters(k) -0.5/2  ... 

&  x(i)<=bincenters(k) +0.5/2 
h(k, l)=h(k, 1)+1 ; 
end 
end 
end 

pxest=h/( 1000*0. 5) ; 
xaxis= [-4 : 0 . 01 : 4] 3  ; 
px=(l/sqrt (2*pi) )*exp(-0.5*xaxis . "2) ; 


Figure  2.7:  Estimated  and  true  probability  density  functions. 


M  should  be  large.  In  particular,  if  we  let  a  =  2  and  b  =  oo,  then  we  should  obtain 
the  value  (which  must  be  evaluated  using  numerical  integration) 


P[X  >  2]  = 


exp  ( — (1  /2)rr2)  dx  =  0.0228 


and  therefore  very  few  realizations  can  be  expected  to  fall  in  this  interval.  The  results 
for  an  increasing  number  of  realizations  are  shown  in  Figure  2.8.  This  illustrates  the 
problem  with  the  simulation  of  small  probability  events.  It  requires  a  large  number 
of  realizations  to  obtain  accurate  results.  (See  Problem  11.47  on  how  to  reduce  the 
number  of  realizations  required.) 

0 


Example  2.3  —  Average  value 

It  is  frequently  important  to  measure  characteristics  of  X  in  addition  to  the  PDF. 
For  example,  we  might  only  be  interested  in  the  average  or  mean  or  expected  value 
of  X.  If  the  random  variable  is  Gaussian,  then  from  Figure  2.6  we  would  expect  X 
to  be  zero  on  the  average.  This  conjecture  is  easily  “verified”  by  using  the  sample 
mean  estimate 


1 

M 


M 


i—1 
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M 

Estimated  P[X  >  2] 

True  P[X  >  2' 

100 

0.0100 

0.0228 

1000 

0.0150 

0.0228 

10,000 

0.0244 

0.0288 

100,000 

0.0231 

0.0288 

randnO  state* ,0) 
M=100;count=0; 
x=randn(M, 1) ; 
for  i=l:M 
if  x(i)>2 

count=count+l ; 
end 
end 

probest=count/M 


Figure  2.8:  Estimated  and  true  probabilities. 


of  the  mean.  The  results  are  shown  in  Figure  2.9. 


M 

Estimated  mean 

True  mean 

100 

0.0479 

0 

1000 

-0.0431 

0 

10,000 

0.0011 

0 

100,000 

0.0032 

0 

randn(* state  *  ,0) 

M=100; 
meanest=0; 
x=randn(M, 1) ; 
for  i=l:M 

meanest=meanest+(l/M)*x(i) ; 
end 

meanest 


Figure  2.9:  Estimated  and  true  mean. 


Example  2.4  —  A  transformed  random  variable 

One  of  the  most  important  problems  in  probability  is  to  determine  the  PDF  for 
a  transformed  random  variable,  i.e.,  one  that  is  a  function  of  X,  say  X 2  as  an 
example.  This  is  easily  accomplished  by  modifying  the  code  in  Figure  2.7  from 
x=randn(1000, 1)  to  x=randn(1000, 1)  ;x=x.~2;.  The  results  are  shown  in  Figure 
2.10.  Note  that  the  shape  of  the  PDF  is  completely  different  than  the  original 
Gaussian  shape  (see  Example  10.7  for  the  true  PDF).  Additionally,  we  can  obtain 
the  mean  of  X 2  by  using 
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Figure  2.10:  Estimated  PDF  of  X 2  for  X  Gaussian. 


as  we  did  in  Example  2.3.  The  results  are  shown  in  Figure  2.11. 


M 

Estimated  mean 

True  mean 

100 

0.7491 

1 

1000 

0.8911 

1 

10,000 

1.0022 

1 

100,000 

1.0073 

1 

randnO  state }  ,0) 

M=100; 
meanest=0; 
x=randn(M, 1) ; 
for  i=l:M 

meanest=meanest+(l/M)*x(i) "2; 
end 

meanest 


Figure  2.11:  Estimated  and  true  mean. 


Example  2.5  —  Multiple  random  variables 

Consider  an  experiment  that  yields  two  random  variables  or  the  vector  random 
variable  [X\  X%Y ->  where  T  denotes  the  transpose.  An  example  might  be  the  choice 
of  a  point  in  the  square  {(x,y)  :  0  <  x  <  1,0  <  y  <  1}  according  to  some  procedure. 
This  procedure  may  or  may  not  cause  the  value  of  x<i  to  depend  on  the  value  of 
x\.  For  example,  if  the  result  of  many  repetitions  of  this  experiment  produced  an 
even  distribution  of  points  indicated  by  the  shaded  region  in  Figure  2.12a,  then  we 
would  say  that  there  is  no  dependency  between  X\  and  X2.  On  the  other  hand,  if 
the  points  were  evenly  distributed  within  the  shaded  region  shown  in  Figure  2.12b, 
then  there  is  a  strong  dependency.  This  is  because  if,  for  example,  x\  —  0.5,  then 
X2  would  have  to  lie  in  the  interval  [0.25,0.75].  Consider  next  the  random  vector 
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1 .5  - f - ! - 

1 - ..  ■  . 

cs 

H 

0.5  . 

O' - - 

0  0.5  1  1.5 

Xi 

(a)  No  dependency 


(b)  Dependency 


Figure  2.12:  Relationships  between  random  variables. 


■  Xi ' 

'  Ux  ‘ 

.  X-i  . 

.  . 

where  each  Ui  is  generated  using  rand.  The  result  of  M  —  1000  realizations  is  shown 
in  Figure  2.13a.  We  say  that  the  random  variables  X\  and  X<i  are  independent  Of 
course,  this  is  what  we  expect  from  a  good  random  number  generator.  If  instead, 
we  defined  the  new  random  variables, 


'  Xx  ' 

Ui 

.  X2  . 

.  \Ux  +  \U2  . 

then  from  the  plot  shown  in  Figure  2.13b,  we  would  say  that  the  random  variables 
are  dependent.  Note  that  this  type  of  plot  is  called  a  scatter  diagram. 

0 


2.6  Real-World  Example  -  Digital  Communications 

In  a  phase-shift  keyed  (PSK)  digital  communication  system  a  binary  digit  (also 
termed  a  bit),  which  is  either  a  “0”  or  a  “1”,  is  communicated  to  a  receiver  by 
sending  either  so(t)  =  Acos(27rFot  +  7r)  to  represent  a  “0”  or  si(t)  =  Acos(2irFot ) 
to  represent  a  “1”,  where  A  >  0  [Proakis  1989].  The  receiver  that  is  used  to  decode 
the  transmission  is  shown  in  Figure  2.14.  The  input  to  the  receiver  is  the  noise 
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(a)  No  dependency 


(b)  Dependency 


Figure  2.13:  Relationships  between  random  variables. 


x(t) 


Lowpass 

_ *  ALi* 

>  0 

filter 

<  o 

Decision  device 


cos(27rFot) 


1 

0 


Figure  2.14:  Receiver  for  a  PSK  digital  communication  system. 


corrupted  signal  or  x(t)  =  s, it)  +  w(t),  where  w(t)  represents  the  channel  noise. 
Ignoring  the  effect  of  noise  for  the  moment,  the  output  of  the  multiplier  will  be 

so(t)  cos(27r Fot)  =  Acos(2irFot  +  n)  cos(2nFot)  =  —A  ^  i  cos(47rFot)^ 

si(t)  cos(27rF0i)  =  Acos(2-rrF0t)  cos(27r F0t)  =  A  f  ]-  +  ^  cos(47rF0f) 

for  a  0  and  1  sent,  respectively.  After  the  lowpass  filter,  which  filters  out  the 
cos(47rFof)  part  of  the  signal,  and  sampler,  we  have 


j  for  a  0 
Y  for  a  1. 


The  receiver  decides  a  1  was  transmitted  if  f  >  0  and  a  0  if  £  <  0.  To  model  the 
channel  noise  we  assume  that  the  actual  value  of  £  observed  is 


£  = 


-J  +  W  for  a  0 
4  +  W  foral 
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where  If  is  a  Gaussian  random  variable.  It  is  now  of  interest  to  determine  how 
the  error  depends  on  the  signal  amplitude  A.  Consider  the  case  of  a  1  having  been 
transmitted.  Intuitively,  if  A  is  a  large  positive  amplitude,  then  the  chance  that  the 
noise  will  cause  an  error  or  equivalently,  £  <  0,  should  be  small.  This  probability, 
termed  the  probability  of  error  and  denoted  by  Pe,  is  given  by  P[A/2  +  W  <  0]. 
Using  a  computer  simulation  we  can  plot  Pe  versus  A  with  the  result  shown  in  Figure 
2.15.  Also,  the  true  Pe  is  shown.  (In  Example  10.3  we  will  see  how  to  analytically 
determine  this  probability.)  As  expected,  the  probability  of  error  decreases  as  the 


A 


Figure  2.15:  Probability  of  error  for  a  PSK  communication  system. 

signal  amplitude  increases.  With  this  information  we  can  design  our  system  by 
choosing  A  to  satisfy  a  given  probability  of  error  requirement.  In  actual  systems 
this  requirement  is  usually  about  Pe  =  10-7.  Simulating  this  small  probability 
would  be  exceedingly  difficult  due  to  the  large  number  of  trials  required  (but  see 
also  Problem  11.47).  The  MATLAB  code  used  for  the  simulation  is  given  in  Figure 
2.16. 

References 
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Problems 

Note:  All  the  following  problems  require  the  use  of  a  computer  simulation.  A 
realization  of  a  uniform  random  variable  is  obtained  by  using  rand  (1,1)  while  a 
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A= [0 . 1 : 0 . 1 : 5]  >  ; 
for  k=l :length(A) 
error=0 ; 
for  i=l:1000 
w=randn(l , 1) ; 
if  A(k)/2+w<=0 
error=error+l ; 
end 
end 

Pe(k, l)=error/1000; 
end 


Figure  2.16:  MATLAB  code  used  to  estimate  the  probability  of  error  Pe  in  Figure 
2.15. 

realization  of  a  Gaussian  random  variable  is  obtained  by  using  randn(l,l). 

2.1  (o)  (c)  An  experiment  consists  of  tossing  a  fair  coin  twice.  If  a  head  occurs 

on  the  first  toss,  we  let  x\  —  1  and  if  a  tail  occurs  we  let  x\  —  0.  The 
same  assignment  is  used  for  the  outcome  x<i  of  the  second  toss.  Defining  the 
random  variable  as  Y  =  X1X2,  estimate  the  probabilities  for  the  different 
possible  values  of  Y.  Explain  your  results. 

2.2  (c)  A  pair  of  fair  dice  is  tossed.  Estimate  the  probability  of  “snake  eyes”  or  a 

one  for  each  die? 

2.3  (o)  (c)  Estimate  P[—  1  <  X  <  1]  if  X  is  a  Gaussian  random  variable.  Verify 

the  results  of  your  computer  simulation  by  numerically  evaluating  the  integral 

(~ix2)  dx. 


Hint:  See  Problem  1.14. 

2.4  (c)  Estimate  the  PDF  of  the  random  variable 

i= 1  v  7 

where  U%  is  a  uniform  random  variable.  Then,  compare  this  PDF  to  the 
Gaussian  PDF  or 


Px{x)  = 
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2.5  (c)  Estimate  the  PDF  of  X  =  U\  —  C/2?  where  U\  and  C/2  are  uniform  random 

variables.  What  is  the  most  probable  range  of  values? 

2.6  (^)  (c)  Estimate  the  PDF  of  X  =  U1U2,  where  U\  and  U2  are  uniform  random 

variables.  What  is  the  most  probable  range  of  values? 

2.7  (c)  Generate  realizations  of  a  discrete  random  variable  X,  which  takes  on  values 

1,  2,  and  3  with  probabilities  p\  =  0.1,  P2  =  0.2  and  p%  —  0.7,  respectively. 
Next  based  on  the  generated  realizations  estimate  the  probabilities  of  obtaining 
the  various  values  of  X. 

2.8  (o)  (c)  Estimate  the  mean  of  C/,  where  U  is  a  uniform  random  variable.  What 

is  the  true  value? 

2.9  (c)  Estimate  the  mean  of  X  + 1,  where  X  is  a  Gaussian  random  variable.  What 

is  the  true  value? 

2.10  (c)  Estimate  the  mean  of  X2,  where  X  is  a  Gaussian  random  variable. 

2.11  (^)  (c)  Estimate  the  mean  of  2C7,  where  U  is  a  uniform  random  variable. 
What  is  the  true  value? 

2.12  (c)  It  is  conjectured  that  if  X\  and  X2  are  Gaussian  random  variables,  then 
by  subtracting  them  (let  Y  —  X\  —  X2),  the  probable  range  of  values  should 
be  smaller.  Is  this  true? 

2.13  (o)  (c)  A  large  circular  dartboard  is  set  up  with  a  “bullseye”  at  the  center  of 
the  circle,  which  is  at  the  coordinate  (0, 0).  A  dart  is  thrown  at  the  center  but 
lands  at  (X,  Y),  where  X  and  Y  are  two  different  Gaussian  random  variables. 
What  is  the  average  distance  of  the  dart  from  the  bullseye? 

2.14  (^)  (c)  It  is  conjectured  that  the  mean  of  y/U,  where  U  is  a  uniform  random 
variable,  is  \/mean  of  U.  Is  this  true? 

2.15  (c)  The  Gaussian  random  variables  X\  and  X2  are  linearly  transformed  to  the 
new  random  variables 

Y\  =  Xi+ O.IX2 
Y2  =  X1  +0.2X2. 

Plot  a  scatter  diagram  for  Y\  and  Y2.  Could  you  approximately  determine  the 
value  of  Y2  if  you  knew  that  Y\  —  1? 

2.16  (c  ,w)  Generate  a  scatter  diagram  for  the  linearly  transformed  random  vari¬ 
ables 


Xi 

X2 


Ui 

Ul  +  U2 
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where  U\  and  f/v  are  uniform  random  variables.  Can  you  explain  why  the 
scatter  diagram  looks  like  a  parallelogram?  Hint:  Define  the  vectors 


r  Xi  i 


and  express  X  as  a  linear  combination  of  ei  and  e2- 


Appendix  2A 

Brief  Introduction  to  MATLAB 


A  brief  introduction  to  the  scientific  software  package  MATLAB  is  contained  in  this 
appendix.  Further  information  is  available  at  the  Web  site  www.mathworks.com. 
MATLAB  is  a  scientific  computation  and  data  presentation  language. 


Overview  of  MATLAB 


The  chief  advantage  of  MATLAB  is  its  use  of  high-level  instructions  for  matrix  alge¬ 
bra  and  built-in  routines  for  data  processing.  In  this  appendix  as  well  as  throughout 
the  text  a  MATLAB  command  is  indicated  with  the  typewriter  font  such  as  end. 
MATLAB  treats  matrices  of  any  size  (which  includes  vectors  and  scalars  as  special 
cases)  as  elements  and  hence  matrix  multiplication  is  as  simple  as  C=A*B,  where 
A  and  B  are  conformable  matrices.  In  addition  to  the  usual  matrix  operations  of 
addition  C=A+B,  multiplication  C=A*B,  and  scaling  by  a  constant  c  as  B=c*A,  certain 
matrix  operators  are  defined  that  allow  convenient  manipulation.  For  example,  as¬ 
sume  we  first  define  the  column  vector  x  =  [12  3  4]T,  where  T  denotes  transpose,  by 
using  x=[l:4]  \  The  vector  starts  with  the  element  1  and  ends  with  the  element 
4  and  the  colon  indicates  that  the  intervening  elements  are  found  by  incrementing 
the  start  value  by  one,  which  is  the  default.  For  other  increments,  say  0.5,  we  use 
x=[l : 0.5:4]  \  To  define  the  vector  y  —  [11 2 3  22  3242]T,  we  can  use  the  matrix  ele¬ 
ment  by  element  exponentiation  operator  .  ~  to  form  y=x .  "2  if  x=  [1 : 4]  ; .  Similarly, 
the  operators  .  *  and  .  /  perform  element  by  element  multiplication  and  division  of 
the  matrices,  respectively.  For  example,  if 


1  2 
3  4 

1  2 

3  4 


B 
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Character 

Meaning 

+ 

addition  (scalars,  vectors,  matrices) 

— 

subtraction  (scalars,  vectors,  matrices) 

* 

multiplication  (scalars,  vectors,  matrices) 

/ 

division  (scalars) 

exponentiation  (scalars,  square  matrices) 

* 

• 

element  by  element  multiplication 

./ 

element  by  element  division 

• 

element  by  element  exponentiation 

• 

J 

suppress  printed  output  of  operation 

• 

• 

specify  intervening  values 

) 

conjugate  transpose  (transpose  for  real  vectors,  matrices) 

m  •  m 

line  continuation  (when  command  must  be  split) 

7. 

remainder  of  line  interpreted  as  comment 

== 

logical  equals 

1 

logical  or 

& 

logical  and 

logical  not 

Table  2A.1:  Definition  of  common  MATLAB  characters. 


then  the  statements  C=A .  *B  and  D=A .  /B  produce  the  results 


C  = 

D  = 


1  4 
9  16 

1  1 
1  1 


respectively.  A  listing  of  some  common  characters  is  given  in  Table  2A.1.  MATLAB 
has  the  usual  built-in  functions  of  cos,  sin,  etc.  for  the  trigonometric  functions, 
sqrt  for  a  square  root,  exp  for  the  exponential  function,  and  abs  for  absolute  value, 
as  well  as  many  others.  When  a  function  is  applied  to  a  matrix,  the  function  is 
applied  to  each  element  of  the  matrix.  Other  built-in  symbols  and  functions  and 
their  meanings  are  given  in  Table  2 A. 2. 

Matrices  and  vectors  are  easily  specified.  For  example,  to  define  the  column 
vector  Ci  =  [1  2]T,  just  use  cl=[l  2]  .  ’  or  equivalently  cl=[l ; 2] .  To  define  the  C 
matrix  given  previously,  the  construction  C=[l  4 ;  9  16]  is  used.  Or  we  could  first 
define  C2  =  [4  16]T  by  c2=[4  16]  .  ’  and  then  use  C=[cl  c2].  It  is  also  possible 
to  extract  portions  of  matrices  to  yield  smaller  matrices  or  vectors.  For  example, 
to  extract  the  first  column  from  the  matrix  C  use  cl=C( :  ,1).  The  colon  indicates 
that  all  elements  in  the  first  column  should  be  extracted.  Many  other  convenient 
manipulations  of  matrices  and  vectors  are  possible. 
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Function 

Meaning 

Pi 

7 r 

i 

V—  i 

• 

J 

round (x) 

rounds  every  element  in  x  to  the  nearest  integer 

f loor (x) 

replaces  every  element  in  x  by  the  nearest  integer  less  than 
or  equal  to  x 

inv(A) 

takes  the  inverse  of  the  square  matrix  A 

x=zeros(N, 1) 

assigns  an  N  x  1  vector  of  all  zeros  to  x 

x=ones(N, 1) 

assigns  an  N  x  1  vector  of  all  ones  to  x 

x=rand(N, 1) 

generates  an  N  x  1  vector  of  all  uniform  random  variables 

x=randn(N , 1) 

generates  an  N  x  1  vector  of  all  Gaussian  random  variables 

rand( ; state } ,0) 

initializes  uniform  random  number  generator 

randn( } state ; ,0) 

initializes  Gaussian  random  number  generator 

M=length(x) 

sets  M  equal  to  N  if  x  is  N  x  1 

sum(x) 

sums  all  elements  in  vector  x 

mean(x) 

computes  the  sample  mean  of  the  elements  in  x 

f lipud(x) 

flips  the  vector  x  upside  down 

abs 

takes  the  absolute  value  (or  complex  magnitude)  of  every 
element  of  x 

f ft (x,N) 

computes  the  FFT  of  length  N  of  x  (zero  pads  if 
N>length(x)) 

if  ft  (x  ,N) 

computes  the  inverse  FFT  of  length  N  of  x 

f ftshift (x) 

interchanges  the  two  halves  of  an  FFT  output 

pause 

pauses  the  execution  of  a  program 

break 

terminates  a  loop  when  encountered 

whos 

lists  all  variables  and  their  attributes  in  current  workspace 

help 

provides  help  on  commands,  e.g.,  help  sqrt 

Table  2A.2:  Definition  of  useful  MATLAB  symbols  and  functions. 
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Any  vector  that  is  generated  whose  dimensions  are  not  explicitly  specified  is 
assumed  to  be  a  row  vector.  For  example,  if  we  say  x=ones(10),  then  it  will  be 
designated  as  the  1  x  10  row  vector  consisting  of  all  ones.  To  yield  a  column  vector 
use  x=ones(10,l). 

Loops  are  implemented  with  the  construction 

for  k=l : 10 
x(k,l)=l; 
end 

which  is  equivalent  to  x=ones(10,l).  Logical  flow  can  be  accomplished  with  the 
construction 


if  x>0 

y=sqrt (x) ; 
else 
y=0; 
end 

Finally,  a  good  practice  is  to  begin  each  program  or  script,  which  is  called  an  “m” 
file  (due  to  its  syntax,  for  example,  pdf  .m),  with  a  clear  all  command.  This 
will  clear  all  variables  in  the  workspace,  since  otherwise  the  current  program  may 
inadvertently  (on  the  part  of  the  programmer)  use  previously  stored  variable  data. 

Plotting  in  MATLAB 


Plotting  in  MATLAB  is  illustrated  in  the  next  section  by  example.  Some  useful 
functions  are  summarized  in  Table  2A.3. 


Function 

Meaning 

figure 

opens  up  a  new  figure  window 

plot (x,y) 

plots  the  elements  of  x  versus  the  elements  of  y 

plot (xl ,yl ,x2 ,y2) 

same  as  above  except  multiple  plots  are  made 

plot (x,y , 3  .}) 

same  as  plot  except  the  points  are  not  connected 

title Omy  plot*) 

puts  a  title  on  the  plot 

xlabel( 3x3 ) 

labels  the  x  axis 

ylabel( 3y 3 ) 

labels  the  y  axis 

grid 

draws  grid  on  the  plot 

axis( [0124]) 

plots  only  the  points  in  range  0  <  x  <  1  and  2  <  y  <  4 

text (1,1, ; curve  1;) 

places  the  text  “curve  1”  at  the  point  (1,1) 

hold  on 

holds  current  plot 

hold  off 

releases  current  plot 

Table  2A.3:  Definition  of  useful  MATLAB  plotting  functions. 
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An  Example  Program 

A  complete  MATLAB  program  is  given  below  to  illustrate  how  one  might  compute 
the  samples  of  several  sinusoids  of  different  amplitudes.  It  also  allows  the  sinusoids 
to  be  clipped.  The  sinusoid  is  s(t)  =  Acos(27rFot  +  7r/3),  with  A  =  1,  A  =  2,  and 
A  =  4,  Fo  =  1,  and  t  =  0, 0.01, 0.02, . . . ,  10.  The  clipping  level  is  set  at  ±3,  i.e.,  any 
sample  above  +3  is  clipped  to  +3  and  any  sample  less  than  —3  is  clipped  to  —3. 

7®  matlabexample  .m 

7® 

7®  This  program  computes  and  plots  samples  of  a  sinusoid 
7®  with  amplitudes  1,  2,  and  4.  If  desired,  the  sinusoid  can  be 
7®  clipped  to  simulate  the  effect  of  a  limiting  device. 

7®  The  frequency  is  1  Hz  and  the  time  duration  is  10  seconds. 

7®  The  sample  interval  is  0.1  seconds.  The  code  is  not  efficient  but 
7®  is  meant  to  illustrate  MATLAB  statements. 

7® 

clear  all  7®  clear  all  variables  from  workspace 
delt=0.01;  7®  set  sampling  time  interval 
F0=1;  7®  set  frequency 

t=[0:delt :  10]  ’ ;  7®  compute  time  samples  0,0.01,0.02,  ...  ,10 
A=[l  2  4]  ;  ;  y0  set  amplitudes 
clip^yes*;  70  set  option  to  clip 

for  i=l :  length(A)  7#  begin  computation  of  sinusoid  samples 

s( :  ,i)=A(i)*cos(2*pi*F0*t+pi/3)  ;  7#  note  that  samples  for  sinusoid 

7o  are  computed  all  at  once  and 
7o  stored  as  columns  in  a  matrix 
if  01^==* yes'  7®  determine  if  clipping  desired 

for  k=l:length(s(:  ,i))  7®  note  that  number  of  samples  given  as 

7#  dimension  of  column  using  length  command 
if  s(k,i)>3  7®  check  to  see  if  sinusoid  sample  exceeds  3 
s(k,i)=3;  7®  if  yes,  then  clip 

elseif  s(k,i)<-3  7®  check  to  see  if  sinusoid  sample  is  less 
s(k,i)=-3;  7®  than  -3  if  yes,  then  clip 

end 

end 

end 

end 

figure  7®  open  up  a  new  figure  window 

plot(t,s(:  ,1)  ,t,s(:  ,2)  ,t,s(:  ,3))  7®  plot  sinusoid  samples  versus  time 

7®  samples  for  all  three  sinusoids 

grid  7®  add  grid  to  plot 
xlabelC'time,  t ; )  7®  label  x-axis 
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ylabelC’sCt)  *)  °/0  label  y-axis 

axis([0  10  -4  4])  °/0  set  up  axes  using  axis([xmin  xmax  ymin  ymax] ) 
legend( * A=1 * , }k=2} , JA=4J)  %  display  a  legend  to  distinguish 

7.  different  sinusoids 

The  output  of  the  program  is  shown  in  Figure  2A.1.  Note  that  the  different  graphs 
will  appear  as  different  colors. 


Figure  2A.1:  Output  of  MATLAB  program  matlabexample  .m. 


Chapter  3 


Basic  Probability 


3.1  Introduction 

We  now  begin  the  formal  study  of  probability.  We  do  so  by  utilizing  the  properties 
of  sets  in  conjunction  with  the  axiomatic  approach  to  probability.  In  particular,  we 
will  see  how  to  solve  a  class  of  probability  problems  via  counting  methods.  These 
are  problems  such  as  determining  the  probability  of  obtaining  a  royal  flush  in  poker 
or  of  obtaining  a  defective  item  from  a  batch  of  mostly  good  items,  as  examples. 
Furthermore,  the  axiomatic  approach  will  provide  the  basis  for  all  our  further  studies 
of  probability.  Only  the  methods  of  determining  the  probabilities  will  have  to  be 
modified  in  accordance  with  the  problem  at  hand. 

3.2  Summary 

Section  3.3  reviews  set  theory,  with  Figure  3.1  illustrating  the  standard  definitions. 
Manipulation  of  sets  can  be  facilitated  using  De  Morgan’s  laws  of  (3.6)  and  (3.7). 
The  application  of  set  theory  to  probability  is  summarized  in  Table  3.1.  Using  the 
three  axioms  described  in  Section  3.4  a  theory  of  probability  can  be  formulated 
and  a  means  for  computing  probabilities  constructed.  Properties  of  the  probability 
function  are  given  in  Section  3.5.  In  addition,  the  probability  for  a  union  of  three 
events  is  given  by  (3.20).  An  equally  likely  probability  assignment  for  a  continuous 
sample  space  is  given  by  (3.22)  and  is  shown  to  satisfy  the  basic  axioms.  Section  3.7 
introduces  the  determination  of  probabilities  for  discrete  sample  spaces  with  equally 
likely  outcomes.  The  basic  formula  is  given  by  (3.24).  To  implement  this  approach 
for  more  complicated  problems  in  which  brute-force  counting  of  outcomes  is  not 
possible,  the  subject  of  combinatorics  is  described  in  Section  3.8.  Permutations  and 
combinations  are  defined  and  applied  to  several  examples  for  computing  probabili¬ 
ties.  Based  on  these  counting  methods  the  hypergeometric  probability  law  of  (3.27) 
and  the  binomial  probability  law  of  (3.28)  are  derived  in  Section  3.9.  Finally,  an 
example  of  the  application  of  the  binomial  law  to  a  quality  control  problem  is  given 
in  Section  3.10. 
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3.3  Review  of  Set  Theory 

The  reader  has  undoubtedly  been  introduced  to  set  theory  at  some  point  in  his/her 
education.  We  now  summarize  only  the  salient  definitions  and  properties  that  are 
germane  to  probability.  A  set  is  defined  as  a  collection  of  objects,  for  example, 
the  set  of  students  in  a  probability  class.  The  set  A  can  be  defined  either  by  the 
enumeration  method,  i.e.,  a  listing  of  the  students  as 

A  =  {Jane,  Bill,  Jessica,  Fred}  (3.1) 

or  by  the  description  method 

A  =  {students:  each  student  is  enrolled  in  the  probability  class} 

where  the  is  read  as  “such  that”.  Another  example  would  be  the  set  of  natural 
numbers  or 


B  —  {1, 2, 3, . . .}  (enumeration)  (3.2) 

B  —  {I :  I  is  an  integer  and  I  >  1}  (description). 

Each  object  in  the  set  is  called  an  element  and  each  element  is  distinct.  For  example, 
the  sets  {1, 2,  3}  and  {1, 2, 1, 3}  are  equivalent.  There  is  no  reason  to  list  an  element 
in  a  set  more  than  once.  Likewise,  the  ordering  of  the  elements  within  the  set 
is  not  important.  The  sets  {1,2,3}  and  {2,1,3}  are  equivalent.  Sets  are  said  to 
be  equal  if  they  contain  the  same  elements.  For  example,  if  C\  —  {Bill,  Fred} 
and  C2  =  {male  members  in  the  probability  class},  then  C\  —  Although  the 
description  may  change,  it  is  ultimately  the  contents  of  the  set  that  is  of  importance. 
An  element  x  of  a  set  A  is  denoted  using  the  symbolism  x  E  A,  and  is  read  as  “ x 
is  contained  in  A”,  as  for  example,  1  E  B  for  the  set  B  defined  in  (3.2).  Some  sets 
have  no  elements.  If  the  instructor  in  the  probability  class  does  not  give  out  any 
grades  of  “A”,  then  the  set  of  students  receiving  an  “A”  is  D  =  {  }.  This  is  called 
the  empty  set  or  the  null  set.  It  is  denoted  by  0  so  that  D  =  0.  On  the  other  hand, 
the  instructor  may  be  an  easy  grader  and  give  out  all  “A”s.  Then,  we  say  that 
D  =  <S,  where  S  is  called  the  universal  set  or  the  set  of  all  students  enrolled  in  the 
probability  class.  These  concepts,  in  addition  to  some  others,  are  further  illustrated 
in  the  next  example. 

Example  3.1  -  Set  concepts 

Consider  the  set  of  all  outcomes  of  a  tossed  die.  This  is 

A  =  {1,2, 3, 4, 5, 6}.  (3.3) 

The  numbers  1,2, 3, 4, 5, 6  are  its  elements,  which  are  distinct.  The  set  of  integer 
numbers  from  1  to  6  or  B  =  {I  :  1  <  I  <  6}  is  equal  to  A.  The  set  A  is  also 
the  universal  set  S  since  it  contains  all  the  outcomes.  This  is  in  contrast  to  the  set 
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C  =  {2,4,6},  which  contains  only  the  even  outcomes.  The  set  C  is  called  a  subset 
of  A.  A  simple  set  is  a  set  containing  a  single  element,  as  for  example,  C7  =  {1}- 

0 


Element  vs.  simple  set 


In  the  example  of  the  probability  class  consider  the  set  of  instructors.  Usually, 
there  is  only  one  instructor  and  so  the  set  of  instructors  can  be  defined  as  the 
simple  set  A  =  {Professor  Laplace}.  However,  this  is  not  the  same  as  the  “element” 
given  by  Professor  Laplace.  A  distinction  is  therefore  made  between  the  instructors 
teaching  probability  and  an  individual  instructor.  As  another  example,  it  is  clear 
that  sometimes  elements  in  a  set  can  be  added,  as,  for  example,  2  +  3  =  5,  but  it 
makes  no  sense  to  add  sets  as  in  {2}  +  {3}  =  {5}. 

A 

More  formally,  a  set  B  is  defined  as  a  subset  of  a  set  A  if  every  element  in  B  is  also 
an  element  of  A.  We  write  this  as  B  C  A.  This  also  includes  the  case  of  B  =  A.  In 
fact,  we  can  say  that  A  =  B  if  A  C  B  and  B  C  A. 

Besides  subsets,  new  sets  may  be  derived  from  other  sets  in  a  number  of  ways.  If 
<S  =  {x  :  — oo  <  x  <  oo}  (called  the  set  of  real  numbers ),  then  A  =  {x  :  0  <  x  <  2} 
is  clearly  a  subset  of  S.  The  complement  of  A,  denoted  by  Ac,  is  the  set  of  elements 
in  S  but  not  in  A.  This  is  Ac  =  {a;  :  x  <  0  or  x  >  2}.  Two  sets  can  be  combined 
together  to  form  a  new  set.  For  example,  if 


A  =  {x  :  0  <  x  <  2} 

B  =  {x  :  1  <  x  <  3}  (3.4) 

then  the  union  of  A  and  B,  denoted  by  A  U  B,  is  the  set  of  elements  that  belong  to 
A  or  B  or  both  A  and  B  (so-called  inclusive  or).  Hence,  ylUB  =  {a;:0<a;<3}. 
This  definition  may  be  extended  to  multiple  sets  A\,  A,  •  •  • ,  Av  so  that  the  union 
is  the  set  of  elements  for  which  each  element  belongs  to  at  least  one  of  these  sets. 
It  is  denoted  by 

N 

A\  U  A2  U  A2  U  ■  ■  •  U  j4jv  —  IK 

i=i 

The  intersection  of  sets  A  and  B,  denoted  by  AC\B,  is  defined  as  the  set  of  elements 

that  belong  to  both  A  and  B.  Hence,  A  fl  B  —  {a;  :  1  <  x  <  2}  for  the  sets  of  (3.4). 

We  will  sometimes  use  the  shortened  symbolism  AB  to  denote  Af)B.  This  definition 
may  be  extended  to  multiple  sets  Ai,A2,...,An  so  that  the  intersection  is  the  set 
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of  elements  for  which  each  element  belongs  to  all  of  these  sets.  It  is  denoted  by 

N 

A\  n  A2  n  A2  n  •••  n  An  =  Ha.- 

2—1 

The  difference  between  sets,  denoted  by  A  —  J3,  is  the  set  of  elements  in  A  but  not 
in  B.  Hence,  for  the  sets  of  (3.4)  A  —  B  =  {x  :  0  <  x  <  1}.  These  concepts  can 
be  illustrated  pictorially  using  a  Venn  diagram  as  shown  in  Figure  3.1.  The  darkly 


(a)  Universal  set  5 


(b)  Set  A 


(c)  Set  Ac 


(d)  Set  AUB  <e)  Set  ACiB  (f)  Set  A  -  B 

Figure  3.1:  Illustration  of  set  definitions  -  darkly  shaded  region  indicates  the  set. 

shaded  regions  are  the  sets  described.  The  dashed  portions  are  not  included  in  the 
sets.  A  Venn  diagram  is  useful  for  visualizing  set  operations.  As  an  example,  one 
might  inquire  whether  the  sets  A  —  B  and  A  fl  Bc  are  equivalent  or  if 

A- B  =  AnBc.  (3.5) 

From  Figures  3.2  and  3. If  we  see  that  they  appear  to  be.  However,  to  formally 
prove  that  this  relationship  is  true  requires  one  to  let  C  =  A  —  B,  D  =  A  n  Bc  and 
prove  that  (a)Ccfl  and  (b)  D  C  C.  To  prove  (a)  assume  that  x  G  A-  B.  Then, 
by  definition  of  the  difference  set  (see  Figure  3. If)  x  G  A  but  x  is  not  an  element  of 
B.  Hence,  x  E  A  and  x  must  also  be  an  element  of  Bc.  Since  D  —  A  fl  Bc,  x  must 
be  an  element  of  D.  Hence,  x  E  A  f)  Bc  and  since  this  is  true  for  every  x  E  A  —  B, 
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Figure  3.2:  Using  Venn  diagrams  to  “validate”  set  relationships. 


we  have  that  A  —  B  C  A  n  Bc.  The  reader  is  asked  to  complete  the  proof  of  (b)  in 
Problem  3.6. 

With  the  foregoing  set  definitions  a  number  of  results  follow.  They  will  be  useful 
in  manipulating  sets  to  allow  easier  calculation  of  probabilities.  We  now  list  these. 

1.  ( AC)C  =  A 

2.  A  U  Ac  =  S,  A  n  Ac  =  0 

3.  AU0  =  A,  A(10  =  0 

4.  AUS  =  S,  ADS  =  A 

5.  <SC  =  0,  0C  =  S.  • 

If  two  sets  A  and  B  have  no  elements  in  common,  they  are  said  to  be  disjoint. 
The  condition  for  being  disjoint  is  therefore  A  D  B  =  0.  If,  furthermore,  the  sets 
contain  between  them  all  the  elements  of  S,  then  the  sets  are  said  to  partition  the 
universe.  This  latter  additional  condition  is  that  A  U  B  =  S.  An  example  of  sets 
that  partition  the  universe  is  given  in  Figure  3.3.  Note  also  that  the  sets  A  and  Ac 


Figure  3.3:  Sets  that  partition  the  universal  set. 


are  always  a  partitioning  of  S  (why?).  More  generally,  mutually  disjoint  sets  or  sets 
A  i .  /U , . . . .  AjV  for  which  .4,  n  Aj  =  0  for  all  i  ^  j  are  said  to  partition  the  universe 
if  S  —  U f-\Ai  (see  also  Problem  3.9  on  how  to  construct  these  sets  in  general).  For 
example,  the  set  of  students  enrolled  in  the  probability  class,  which  is  defined  as  the 
universe  (although  of  course  other  universes  may  be  defined  such  as  the  set  of  all 
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students  attending  the  given  university),  is  partitioned  by 

A\  =  {males}  =  {Bill,  Fred} 

A2  —  {females}  =  {Jane,  Jessica}. 

Algebraic  rules  for  manipulating  multiple  sets,  which  will  be  useful,  are 

1.  All  B  =  B  U  A 

An  B  =  B  fi  A  commutative  properties 

2.  AU(BUC)  =  (AUB)UC' 

A  n  (B  n  C)  =  {A  fl  B)  fi  C  associative  properties 

3.  An(BnC)  =  (AnB)n(Anc) 

A  U  (B  n  C)  =  {A  U  B)  n  (A  U  C)  distributive  properties. 

Another  important  relationship  for  manipulating  sets  is  De  Morgan’s  law.  Referring 


(a)  Set  A  U  B  (b)  Set  Ac  D  Bc 

Figure  3.4:  Illustration  of  De  Morgan’s  law. 
to  Figure  3.4  it  is  obvious  that 

A\J  B  =  {Ac  n  Bc)c  (3.6) 

which  allows  one  to  convert  from  unions  to  intersections.  To  convert  from  intersec¬ 
tions  to  unions  we  let  A  =  Cc  and  B  =  Dc  in  (3.6)  to  obtain 

Cc  U  Dc  =  (C  n  D)c 

and  therefore 

C  n  D  =  (Cc  U  Dc)c.  (3.7) 

In  either  case  we  can  perform  the  conversion  by  the  following  set  of  rules: 

1.  Change  the  unions  to  intersections  and  the  intersections  to  unions  (A  U  B  => 

AnB) 

2.  Complement  each  set  [A  fl  B  =»  Ac  fl  Bc) 
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3.  Complement  the  overall  expression  ( Ac  DBC4  ( Ac  fl  Bc)c). 


Finally,  we  discuss  the  size  of  a  set.  This  will  be  of  extreme  importance  in  assign¬ 
ing  probabilities.  The  set  {2, 4, 6}  is  a  finite  set,  having  a  finite  number  of  elements. 
The  set  {2,4,6,...}  is  an  infinite  set,  having  an  infinite  number  of  elements.  In 
the  latter  case,  although  the  set  is  infinite,  it  is  said  to  be  countably  infinite.  This 
means  that  “in  theory”  we  can  count  the  number  of  elements  in  the  set.  (We  do  so 
by  pairing  up  each  element  in  the  set  with  an  element  in  the  set  of  natural  numbers 
or  {1, 2, 3, . . .}).  In  either  case,  the  set  is  said  to  be  discrete.  The  set  may  be  pic¬ 
tured  as  points  on  the  real  line.  In  contrast  to  these  sets  the  set  {x  :  0  <  x  <  1}  is 
infinite  and  cannot  be  counted.  This  set  is  termed  continuous  and  is  pictured  as  a 
line  segment  on  the  real  line.  Another  example  follows. 


Example  3.2  -  Size  of  sets 

The  sets 


A  = 

B  = 
C  = 


fl  1  1 

\8’  4’  2’ 


1  1  1  1 

2 ’  3’  4’  J 


{x  :  0  <  x  <  1} 


finite  set  -  discrete 

countably  infinite  set  -  discrete 
infinite  set  -  continuous 


are  pictured  in  Figure  3.5. 


ord  ond 


•  ■  • 

— •— » 


o 


1st  element 


T 

^WMM 

1 


(a)  Finite  set,  A 


(b)  Countably  infinite 
set,  B 


(c)  Infinite  continuous 
set,  C 


Figure  3.5:  Examples  of  sets  of  different  sizes. 


3.4  Assigning  and  Determining  Probabilities 

In  the  previous  section  we  reviewed  various  aspects  of  set  theory.  This  is  because  the 
concept  of  sets  and  operations  on  sets  provide  an  ideal  description  for  a  probabilistic 
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model  and  the  means  for  determining  the  probabilites  associated  with  the  model. 
Consider  the  tossing  of  a  fair  die.  The  possible  outcomes  comprise  the  elements 
of  the  set  S  =  {1, 2, 3, 4, 5, 6}.  Note  that  this  set  is  composed  of  all  the  possible 
outcomes,  and  as  such  is  the  universal  set.  In  probability  theory  S  is  termed  the 
sample  space  and  its  elements  s  are  the  outcomes  or  sample  points.  At  times  we  may 
be  interested  in  a  particular  outcome  of  the  die  tossing  experiment.  Other  times  we 
might  not  be  interested  in  a  particular  outcome,  but  whether  or  not  the  outcome 
was  an  even  number,  as  an  example.  Hence,  we  would  inquire  as  to  whether  the 
outcome  was  included  in  the  set  jEeven  =  {2,4,6}.  Clearly,  Eeyen  is  a  subset  of  S 
and  is  termed  an  event.  The  simplest  type  of  events  are  the  ones  that  contain  only 
a  single  outcome  such  as  E\  —  {1},  E2  =  {2},  or  Eq  —  {6},  as  examples.  These  are 
called  simple  events.  Other  events  are  <S,  the  sample  space  itself,  and  0  =  {},  the 
set  with  no  outcomes.  These  events  are  termed  the  certain  event  and  the  impossible 
event ,  respectively.  This  is  because  the  outcome  of  the  experiment  must  be  an 
element  of  S  so  that  S  is  certain  to  occur.  Also,  the  event  that  does  not  contain  any 
outcomes  cannot  occur  so  that  this  event  is  impossible.  Note  that  we  are  saying  that 
an  event  occurs  if  the  outcome  is  an  element  of  the  defining  set  of  that  event.  For 
example,  the  event  that  a  tossed  die  produces  an  even  number  occurs  if  it  comes  up 
a  2  or  a  4  or  a  6.  These  numbers  are  just  the  elements  of  Eeven.  Disjoint  sets  such 
as  {1,2}  and  {3,4}  are  said  to  be  mutually  exclusive ,  in  that  an  outcome  cannot 
be  in  both  sets  simultaneously  and  hence  both  events  cannot  occur.  The  events 
then  are  said  to  be  mutually  exclusive.  It  is  seen  that  probabilistic  questions  can 
be  formulated  using  set  theory,  albeit  with  its  own  terminology.  A  summary  of  the 
equivalent  terms  used  is  given  in  Table  3.1. 


Set  theory 

Probability  theory 

Probability  symbol 

universe 
element 
subset 
disjoint  sets 
null  set 
simple  set 

sample  space  (certain  event) 
outcome  (sample  point) 
event 

mutually  exclusive  events 
impossible  event 
simple  event 

<s 

s 

E 

E\  n  E2  =  0 

0 

Table  3.1:  Terminology  for  set  and  probability  theory. 

In  order  to  develop  a  theory  of  probability  we  must  next  assign  probabilities  to 
events.  For  example,  what  is  the  probability  that  the  tossed  die  will  produce  an 
even  outcome?  Denoting  this  probability  by  P[PevenL  we  would  intuitively  say  that 
it  is  1/2  since  there  are  3  chances  out  of  6  to  produce  an  even  outcome.  Note  that  P 
is  a  probability  function  or  a  function  that  assigns  a  number  between  0  and  1  to  sets. 
It  is  sometimes  called  a  set  function.  The  reader  is  familiar  with  ordinary  functions 
such  as  g(x)  =  exp(ar),  in  which  a  number  y,  where  y  =  g(x ),  is  assigned  to  each  x 


3.4.  ASSIGNING  AND  DETERMINING  PROBABILITIES 


45 


for  —  oo  <  x  <  oo,  and  where  each  x  is  a  distinct  number.  The  probability  function 
must  assign  a  number  to  every  event,  or  to  every  set.  For  a  coin  toss  whose  outcome 
is  either  a  head  H  or  a  tail  T,  all  the  events  are  E\  —  {if},  E2  =  {T},  E%  =  <S, 
and  i?4  =  0.  For  a  die  toss  all  the  events  are  Eo  —  0,  E\  =  {1}, . . .  ,Eq  =  {6}, 
E12  =  {1,  2},  .  .  .  ,  Es6  =  {5,  6},  .  .  ^12345  =  {1,  2,  3,  4,  5},  ,  i?23456  =  {2, 3, 4, 5, 6}, 

^123456  —  {1, 2, 3, 4, 5,6}  =  <S.  There  are  a  total  of  64  events.  In  general,  if  the 
sample  space  has  N  simple  events,  the  total  number  of  events  is  2N  (see  Problem 
3.15).  We  must  be  able  to  assign  probabilities  to  all  of  these.  In  accordance  with 
our  intuitive  notion  of  probability  we  assign  a  number,  either  zero  or  positive,  to 
each  event.  Hence,  we  require  that 

Axiom  1  P[E\  >  0  for  every  event  E. 

Also,  since  the  die  toss  will  always  produce  an  outcome  that  is  included  in  S  = 
{1, 2,  3, 4, 5, 6}  we  should  require  that 

Axiom  2  P[S]  =  1. 

Next  we  might  inquire  as  to  the  assignment  of  a  probability  to  the  event  that  the 
die  comes  up  either  less  than  or  equal  to  2  or  equal  to  3.  Intuitively,  we  would  say 
that  it  is  3/6  since 

P[{1,2}U{3}]  =  P[{l,2}]+P[{3}] 

2  1  _  1 
6  +  6  _  2‘ 

However,  we  would  not  assert  that  the  probability  of  the  die  coming  up  either  less 
than  or  equal  to  3  or  equal  to  3  is 

P[{1,2,3}U{3}]  =  P[{l,2,3}]+P[{3}] 

3  1  _  4 

6  +  6~6' 

This  is  because  the  event  {1,2,3}  U  {3}  is  just  {1,2,3}  (we  should  not  count  the 
3  twice)  and  so  the  probability  should  be  1/2.  In  the  first  example,  the  events  are 
mutually  exclusive  (the  sets  are  disjoint)  while  in  the  second  example  they  are  not. 
Hence,  the  probability  of  an  event  that  is  the  union  of  two  mutually  exclusive  events 
should  be  the  sum  of  the  probabilities.  Combining  this  axiom  with  the  previous  ones 
produces  the  full  set  of  axioms,  which  we  summarize  next  for  convenience. 

Axiom  1  P[E]  >  0  for  every  event  E 

Axiom  2  P[S]  =  1 

Axiom  3  P[E  U  F]  =  P[E]  +  P[F]  for  E  and  F  mutually  exclusive. 

Using  induction  (see  Problem  3.17)  the  third  axiom  may  be  extended  to 
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N 

Axiom  3'  P[U£Li  ■E'i]  =  Ep[^  for  all  E^s  mutually  exclusive. 

2=1 

The  acceptance  of  these  axioms  as  the  basis  for  probability  is  called  the  axiomatic 
approach  to  probability.  It  is  remarkable  that  these  three  axioms,  along  with  a  fourth 
axiom  to  be  introduced  later,  are  adequate  to  formulate  the  entire  theory.  We  now 
illustrate  the  application  of  these  axioms  to  probability  calculations. 

Example  3.3  —  Die  toss 

Determine  the  probability  that  the  outcome  of  a  fair  die  toss  is  even.  The  event 
is  .Eleven  =  {2,4,6}.  The  assumption  that  the  die  is  fair  means  that  each  outcome 
must  be  equally  likely.  Defining  E{  as  the  simple  event  {i}  we  note  that 


6 

S=\J  Ei 

2=1 

and  from  Axiom  2  we  must  have 


-  P[S]  =  1. 


But  since  each  Ei  is  a  simple  event  and  by  definition  the  simple  events  are  mutually 
exclusive  (only  one  outcome  or  simple  event  can  occur),  we  have  from  Axiom  3'  that 


P 


LJs. 


L*=i 


6 


2=1 


Next  we  note  that  the  outcomes  are  assumed  to  be  equally  likely  which  means  that 
P[Ei]  =  P[E2 ]  =  •  ■  ■  =  P[Eq]  =  p.  Hence,  we  must  have  from  (3.8)  and  (3.9)  that 


6 


E^i]  =  6p  =  l 


or  P[Ei\  =  1/6  for  all  i.  We  can  now  finally  determine  P[Eeven]  since  Eeven  = 
E2  U  E4  U  Eq.  By  applying  Axiom  3'  once  again  we  have 

Pleven]  =  P[E2  U  P4  U  E6]  =  P[E2]  +  P[E4 ]  +  P[P6]  =  1  +  I  +  I  =  1. 

o  0  0  z 

0 

In  general,  the  probabilities  assigned  to  each  simple  event  need  not  be  the  same, 
i.e.,  the  outcomes  of  a  die  toss  may  not  have  equal  probabilities.  One  might  have 
weighted  the  die  so  that  the  number  6  comes  up  twice  as  often  as  all  the  others.  The 
numbers  1, 2, 3, 4, 5  could  still  be  equally  likely.  In  such  a  case,  since  the  probabilities 
of  the  all  the  simple  events  must  sum  to  one,  we  would  have  the  assignment  P[{i}]  = 
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1/7  for  i  =  1,2,  3, 4,  5  and  P[{6}]  =  2/7.  In  either  case,  to  compute  the  probability 
of  any  event  it  is  only  necessary  to  sum  the  probabilities  of  the  simple  events  that 
make  up  that  event.  Letting  P[{5*}]  be  the  probability  of  the  ith  simple  event  we 
have  that 

P[E}=  Y,  (3-10) 

{ilSieE} 

We  now  simplify  the  notation  by  omitting  the  {  }  when  referring  to  events.  Instead 
of  P[{1}]  we  will  use  P[l].  Another  example  follows. 

Example  3.4  —  Defective  die  toss 

A  defective  die  is  tossed  whose  sides  have  been  mistakenly  manufactured  with  the 
number  of  dots  being  1, 1,2, 2, 3, 4.  The  simple  events  are  S\  =  1,  <S2  =  1,  s$  =  2, 
54  =  2,  S5  =  3,  =  4.  Even  though  some  of  the  outcomes  have  the  same  number 

of  dots,  they  are  actually  different  in  that  a  different  side  is  being  observed.  Each 
side  is  equally  likely  to  appear.  What  is  the  probability  that  the  outcome  is  less 
than  3?  Noting  that  the  event  of  interest  is  {51,52,53,54},  we  use  (3.10)  to  obtain 

A  4 

P[E\  =  Pfoutcome  <  3]  =  ^~^P[5*]  =  -. 

i= 1 

The  formula  given  by  (3.10)  also  applies  to  probability  problems  for  which  the  sample 
space  is  countably  infinite.  Therefore,  it  applies  to  all  discrete  sample  spaces  (see 
also  Example  3.2). 

Example  3.5  —  Countably  infinite  sample  space 

A  habitually  tardy  person  arrives  at  the  theater  late  by  st  minutes,  where 


Si  i  i  —  1,2,3...  . 

If  P[Sj]  =  (1/2)*,  what  is  the  probability  that  he  will  be  more  than  1  minute  late? 
The  event  is  E  —  {2, 3,4, . . .}.  Using  (3.10)  we  have 


Using  the  formula  for  the  sum  of  a  geometric  progression  (see  Appendix  B) 

OO  £ 

V''  a*  =  - -  for  lal  <  1 

f-'  1  -  a  11 


i=k 


we  have  that 


i\2 


P[E] 


(h) 


i-i  2 


❖ 


In  the  above  example  we  have  implicitly  used  the  relationship 
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(3.11) 


where  Ei  =  { st }  and  hence  the  Ei  s  are  mutually  exclusive.  This  does  not  automat¬ 
ically  follow  from  Axiom  3'  since  N  is  now  infinite.  However,  we  will  assume  for  our 
problems  of  interest  that  it  does.  Adding  (3.11)  to  our  list  of  axioms  we  have 

OO 

Axiom  4  P[(J  Zi  Ei\  =  for  all  Ei  s  mutually  exclusive. 

*=i 

See  [Billingsley  1986]  for  further  details. 


3.5  Properties  of  the  Probability  Function 

From  the  four  axioms  we  may  derive  many  useful  properties  for  evaluating  proba¬ 
bilities.  We  now  summarize  these  properties. 

Property  3.1  -  Probability  of  complement  event 

P[EC]  =  1  -  P[E\.  (3.12) 

Proof:  By  definition  E  U  Ec  =  S.  Also,  by  definition  E  and  Ec  are  mutually 
exclusive.  Hence, 

1  =  P[S]  (Axiom  2) 

=  P[E  U  Ec]  (definition  of  complement  set) 

=  P[E\  +  P[EC ]  (Axiom  3) 

from  which  (3.12)  follows. 

□ 

We  could  have  determined  the  probability  in  Example  3.5  without  the  use  of  the 
geometric  progression  formula  by  using  P[E\  =  1  —  P[EC]  =  1  —  P[l]  =  1/2. 

Property  3.2  —  Probability  of  impossible  event 

P[0]  -  0.  (3.13) 

Proof:  Since  0  =  Sc  we  have 

P[0]  =  P[SC } 

=  1  —  P[S]  (from  Property  3.1) 

=  1  —  1  (from  Axiom  2) 

=  0. 


□ 
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We  will  see  later  that  there  are  other  events  for  which  the  probability  can  be  zero. 
Thus,  the  converse  is  not  true. 

Property  3.3  -  All  probabilities  are  between  0  and  1. 

Proof: 

S  —  E  U  Ec  (definition  of  complement  set) 

P[S]  =  P[E]+P[EC]  (Axiom  3) 

1  =  P[E]  +  P[EC]  (Axiom  2) 

But  from  Axiom  1  P[EC ]  >  0  and  therefore 

P[E]  =  1  -  P[EC]  <  1.  (3.14) 

Combining  this  result  with  Axiom  1  proves  Property  3.3. 

□ 

Property  3.4  -  Formula  for  P[E  U  F]  where  E  and  F  are  not  mutually 
exclusive 


P[E  U  F]=  P[E]  +  P[F]  -  P[EF].  (3.15) 

(We  have  shortened  E  fl  F  to  EF.) 

Proof:  By  the  definition  of  E  —  F  we  have  that  EU  F  —  (E  —  F)  U  F  (see  Figure 
3.1d,f).  Also,  the  events  E  —  F  and  F  are  by  definition  mutually  exclusive.  It  follows 
that 

P[E  UF]  =  P[E  -  F]  +  P[F]  (Axiom  3).  (3.16) 

But  by  definition  E  =  (E  —  F)  U  EF  (draw  a  Venn  diagram)  and  E  —  F  and  EF 
are  mutually  exclusive.  Thus, 

P[E]  =  P[E  -F]  +  P[EF)  (Axiom  3).  (3.17) 

Combining  (3.16)  and  (3.17)  produces  Property  3.4. 

□ 

The  effect  of  this  formula  is  to  make  sure  that  the  intersection  EF  is  not  counted 
twice  in  the  probability  calculation.  This  would  be  the  case  if  Axiom  3  were  mis¬ 
takenly  applied  to  sets  that  were  not  mutually  exclusive.  In  the  die  example,  if  we 
wanted  the  probability  of  the  die  coming  up  either  less  than  or  equal  to  3  or  equal 
to  5,  then  we  would  first  define 


E  =  {1,2,3} 
F  =  {3} 
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so  that  EF  =  {3}.  Using  Property  3.4,  we  have  that 

P[E  U  F)=  P[E }  +  P[F]  -  P[EF)  =  !  +  I-  i  =  ^. 

Of  course,  we  could  just  as  easily  have  noted  that  EU  F  =  {1,2, 3}  —  E  and  then 
applied  (3.10).  Another  example  follows. 

Example  3.6  —  Switches  in  parallel 

A  switching  circuit  shown  in  Figure  3.6  consists  of  two  potentially  faulty  switches  in 
parallel.  In  order  for  the  circuit  to  operate  properly  at  least  one  of  the  switches  must 

. _ x _ _ 

switch  1 


I _ x _ 

switch  2 

Figure  3.6:  Parallel  switching  circuit. 

close  to  allow  the  overall  circuit  to  be  closed.  Each  switch  has  a  probability  of  1/2  of 
closing.  The  probability  that  both  switches  close  simultaneously  is  1/4.  What  is  the 
probability  that  the  switching  circuit  will  operate  correctly?  To  solve  this  problem 
we  first  define  the  events  E\  =  {switch  1  closes}  and  E2  =  {switch  2  closes}.  The 
event  that  at  least  one  switch  closes  is  E\  U  E2.  This  includes  the  possibility  that 
both  switches  close.  Then  using  Property  3.4  we  have 

P[E1UE2]  =  P[Ei]  +  P[E2]  —  P[E\E2] 

1  1  1  _  3 

2  +  2  ~  4  4' 

Note  that  by  using  two  switches  in  parallel  as  opposed  to  only  one  switch,  the 
probability  that  the  circuit  will  operate  correctly  has  been  increased.  What  do  you 
think  would  happen  if  we  had  used  three  switches  in  parallel?  Or  if  we  had  used  N 
switches?  Could  you  ever  be  assured  that  the  circuit  would  operate  flawlessly?  (See 
Problem  3.26.) 

0 


Property  3.5  -  Monotonicity  of  probability  function 

Monotonicity  asserts  that  the  larger  the  set,  the  larger  the  probability  of  that  set. 
Mathematically,  this  translates  into  the  statement  that  if  E  C  F,  then  P[E]  <  P[F], 
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Proof:  If  E  C  F,  then  by  definition  F  =  E  U  (F  —  E),  where  E  and  F  —  E  are 
mutually  exclusive  by  definition.  Hence, 

P[F]  =  P[E]  +  P[F  -  E]  (Axiom  3) 

>  P[E\  (Axiom  1). 


□ 

Note  that  since  EF  C  F  and  EF  C  E ,  we  have  that  P[EF]  <  P[E\  and  also  that 
P[EF]  <  P[F ].  The  probability  of  an  intersection  is  always  less  than  or  equal  to 
the  probability  of  the  set  with  the  smallest  probability. 

Example  3.7  —  Switches  in  series 

A  switching  circuit  shown  in  Figure  3.7  consists  of  two  potentially  faulty  switches  in 
series.  In  order  for  the  circuit  to  operate  properly  both  switches  must  close.  For  the 


.X _ x _ _ 

switch  1  switch  2 


Figure  3.7:  Series  switching  circuit. 

same  switches  as  described  in  Example  3.6  what  is  the  probability  that  the  circuit 
will  operate  properly?  Now  we  need  to  find  P[£l£2].  This  was  given  as  1/4  so  that 

\  =  P\EiE2]  <  P[El]  =  I 

Could  the  series  circuit  ever  outperform  the  parallel  circuit?  (See  Problem  3.27.) 

❖ 

One  last  property  that  is  often  useful  is  the  probability  of  a  union  of  more  than 
two  events.  This  extends  Property  3.4.  Consider  first  three  events  so  that  we  wish 
to  derive  a  formula  for  P[E\  U  £2  U  £3],  which  is  equivalent  to  P[(£i  U  E2)  U  £3]  or 
P[E\  U  (E2  U  £3)]  by  the  associative  property.  Writing  this  as  P[E\  U  (E2  U  £3)], 
we  have 

P[£l  U  E2  U  .£3]  =  P\E\  U  ( E2  U  £3)] 

=  P[Ei]  +  P[E2  U  £3]  —  £[£1(^2  U  £3)]  (Property  3.4) 

=  P[Ei]  +  (£[£2]  +  P[£3]  -  P[£2£3]) 

— £[£i(£2  U  £3)]  (Property  3.4) 

(3.18) 

But  £i(£2U£3)  =  £i£2U£i£3  by  the  distributive  property  (draw  a  Venn  diagram) 
so  that 

P[£i(£2U£3)j  =  P[£i£2U£i£3] 

=  P[£i£2]  +  P[£i£3]  —  P[£i£2£3]  (Property  3.4).  (3.19) 
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Substituting  (3.19)  into  (3.18)  produces 


P[E1UE2UE3]  =  P  [E\  ] + P  [1?2  ]  +  P  [E%  ] — P  \Ej2  E3  ] — P  [Ei  E2  ]  P  \E\  E3  ] + P  [E\  E2  E3  ] 

(3.20) 


which  is  the  desired  result.  It  can  further  be  shown  that  (see  Problem  3.29) 


P\E\E^  +  P[E\Ez\  +  P[E2E% ]  >  P\E\E2E3 


so  that 

P[EX  UE2U  E3]  <  P[EX\  +  P[E2]  +  P[E3]  (3.21) 

which  is  known  as  Boole ’s  inequality  or  the  union  bound.  Clearly,  equality  holds  if 
and  only  if  the  Ei  s  are  mutually  exclusive.  Both  (3.20)  and  (3.21)  can  be  extended 
to  any  finite  number  of  unions  [Ross  2002]. 


3.6  Probabilities  for  Continuous  Sample  Spaces 


We  have  introduced  the  axiomatic  approach  to  probability  and  illustrated  the  ap¬ 
proach  with  examples  from  a  discrete  sample  space.  The  axiomatic  approach  is 
completely  general  and  applies  to  continuous  sample  spaces  as  well.  However,  (3.10) 
cannot  be  used  to  determine  probabilities  of  events.  This  is  because  the  simple  events 
of  the  continuous  sample  space  are  not  countable.  For  example,  suppose  one  throws 
a  dart  at  a  “linear”  dartboard  as  shown  in  Figure  3.8  and  measures  the  horizontal 
distance  from  the  “bullseye”  or  center  at  x  =  0.  We  will  then  have  a  sample  space 


Figure  3.8:  “Linear”  dartboard. 


S  —  {x  :  —1/2  <  x  <  1/2},  which  is  not  countable.  A  possible  approach  is  to  assign 
probabilities  to  intervals  as  opposed  to  sample  points.  If  the  dart  is  equally  likely 
to  land  anywhere,  then  we  could  assign  the  interval  [a,  b]  a  probability  equal  to  the 
length  of  the  interval  or 


P[a  <  x  <  b]  =  b  -  a  -  1/2  <  a  <  b  <  1/2.  (3.22) 

Also,  we  will  assume  that  the  probability  of  disjoint  intervals  is  the  sum  of  the 
probabilities  for  each  interval.  This  assignment  is  entirely  consistent  with  our  axioms 
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since 


P[E] 
P[S] 
P[E  U  F) 


—  P[a  <  x  <  b\  =  b  —  a  >0. 

=  P[—l/2  <  x  <  1/2]  =  1/2  —  (—1/2)  —  1. 

—  P[a  <x<bUc<x<d] 

=  (b  —  a)  +  (d  —  c) 

—  P[a  <  x  <  b]  +  P[c  <  x  <  d] 

=  P[E]+P[F\ 


(Axiom  1) 
(Axiom  2) 

(assumption) 
(Axiom  3) 


for  a  <  b  <  c  <  d  so  that  E  and  F  are  mutually  exclusive.  Hence,  an  equally 
likely  type  probability  assignment  for  a  continuous  sample  space  is  a  valid  one  and 
produces  a  probability  equal  to  the  length  of  the  interval.  If  the  sample  space  does 
not  have  unity  length,  as  for  example,  a  dartboard  with  a  length  L,  then  we  should 
use 


Length  of  interval 
Length  of  dartboard 


Length  of  interval 

L 


(3.23) 


Probability  of  a  bullseye 


It  is  an  inescapable  fact  that  the  probability  of  the  dart  landing  at  say  x  =  0  is 
zero  since  the  length  of  this  interval  is  zero.  For  that  matter  the  probability  of 
the  dart  landing  at  any  one  particular  point  xo  is  zero  as  follows  from  (3.22)  with 
a  —  b  —  x o.  The  first-time  reader  of  probability  will  find  this  particularly  disturbing 
and  argue  that  “How  can  the  probability  of  landing  at  every  point  be  zero  if  indeed 
the  dart  had  to  land  at  some  point?”  From  a  pragmatic  viewpoint  we  will  seldom  be 
interested  in  probabilities  of  points  in  a  continuous  sample  space  but  only  in  those  of 
intervals.  How  many  darts  are  there  whose  tips  have  width  zero  and  so  can  be  said 
to  land  at  a  point?  It  is  more  realistic  in  practice  then  to  ask  for  the  probability  that 
the  dart  lands  in  the  bullseye,  which  is  a  small  interval  with  some  nonzero  length. 
That  probability  is  found  by  using  (3.22).  From  a  mathematical  viewpoint  it  is  not 
possible  to  “sum”  up  an  infinite  number  of  positive  numbers  of  equal  value  and  not 
obtain  infinity,  as  opposed  to  one,  as  assumed  in  Axiom  2.  The  latter  is  true  for 
continuous  sample  spaces,  in  which  we  have  an  uncountably  infinite  set,  and  also 
for  discrete  sample  spaces,  which  is  composed  of  a  infinite  but  countable  set.  (Note 
that  in  Example  3.5  we  had  a  countably  infinite  sample  space  but  the  probabilities 
were  not  equal.) 


Since  the  probability  of  a  point  event  occurring  is  zero,  the  probability  of  any  interval 
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is  the  same  whether  or  not  the  endpoints  are  included.  Thus,  for  our  example 

P[a  <x<b]  —  P[a  <  x  <  6]  =  P[a  <  x  <  b]  =  P[a  <  x  <  b]. 


3.7  Probabilities  for  Finite  Sample  Spaces  -  Equally 
Likely  Outcomes 

We  now  consider  in  more  detail  a  discrete  sample  space  with  a  finite  number  of 
outcomes.  Some  examples  that  we  are  already  familiar  with  are  a  coin  toss,  a  die 
toss,  or  the  students  in  a  class.  Furthermore,  we  assume  that  the  simple  events 
or  outcomes  are  equally  likely.  Many  problems  have  this  structure  and  can  be 
approached  using  counting  methods  or  combinatorics.  For  example,  if  two  dice  are 
tossed,  then  the  sample  space  is 


5  =  {(hj)  :  i  =  1,  —  , 6; j  =  1,...,6} 

which  consists  of  36  outcomes  with  each  outcome  or  simple  event  denoted  by  an 
ordered  pair  of  numbers.  If  we  wish  to  assign  probabilities  to  events,  then  we  need 
only  assign  probabilities  to  the  simple  events  and  then  use  (3.10).  But  if  all  the 
simple  events,  denoted  by  are  equally  likely,  then 

=  rh  =  s 

where  Ns  is  the  number  of  outcomes  in  S.  Now  using  (3.10)  we  have  for  any  event 
that 


P[E 


Sij€E} 


SijeE} 

NE 


P[sij] 

1 

ivs 


Ns 

Number  of  outcomes  in  E 
Number  of  outcomes  in  S 


We  will  use  combinatorics  to  determine  Ne  and  Ns  and  hence  P[E]. 

Example  3.8  —  Probability  of  equal  values  for  two-dice  toss 

Each  outcome  with  equal  values  is  of  the  form  ( i ,  i)  so  that 


Number  of  outcomes  with  ( i,i ) 
Total  number  of  outcomes 


(3.24) 
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There  are  6  outcomes  with  equal  values  or  (*,  i)  for  i  =  1, 2, . . . ,  6.  Thus, 


Example  3.9  —  A  more  challenging  problem  -  urns 

An  urn  contains  3  red  balls  and  2  black  balls.  Two  balls  are  chosen  in  succession. 
The  first  ball  is  returned  to  the  urn  before  the  second  ball  is  chosen.  Each  ball  is 
chosen  at  random ,  which  means  that  we  are  equally  likely  to  choose  any  ball.  What 
is  the  probability  of  choosing  first  a  red  ball  and  then  a  black  ball?  To  solve  this 
problem  we  first  need  to  define  the  sample  space.  To  do  so  we  assign  numbers  to  the 
balls  as  follows.  The  red  balls  are  numbered  1, 2, 3  and  the  black  balls  are  numbered 
4,5.  The  sample  space  is  then  S  =  :  i  =  1,2, 3,4, 5;  j  =  1,2, 3, 4, 5}.  The 

event  of  interest  is  E  =  { (i.  j)  :  i  =  1,2, 3;  j  =  4,5}.  We  assume  that  all  the  simple 
events  are  equally  likely.  An  enumeration  of  the  outcomes  is  shown  in  Table  3.2. 
The  outcomes  with  the  asterisks  comprise  E.  Hence,  the  probability  is  P[E ]  =  6/25. 
This  problem  could  also  have  been  solved  using  combinatorics  as  follows.  Since  there 


EBI 

m 

EBI 

EBI 

j  =  5 

i  =  1 

(1,1) 

(1,2) 

(1,3) 

(1,4)* 

(1,5)* 

i  =  2 

(2,1) 

(2,2) 

(2,3) 

(2,4)* 

(2,5)* 

i  =  3 

(3,1) 

(3,2) 

(3,3) 

(3,4)* 

(3,5)* 

i  -  4 

(4,1) 

(4,2) 

(4,3) 

(4,4) 

(4,5) 

i  =  5 

(5,1) 

(5,2) 

(5,3) 

(5,4) 

(5,5) 

Table  3.2:  Enumeration  of  outcomes  for  urn  problem  of  Example  3.9. 

are  5  possible  choices  for  each  ball,  there  are  a  total  of  52  =  25  outcomes  in  the 
sample  space.  There  are  3  possible  ways  to  choose  a  red  ball  on  the  first  draw  and  2 
possible  ways  to  choose  a  black  ball  on  the  second  draw,  yielding  a  total  of  3  •  2  =  6 
possible  ways  of  choosing  a  red  ball  followed  by  a  black  ball.  We  thus  arrive  at  the 
same  probability. 

❖ 


3.8  Combinatorics 

Combinatorics  is  the  study  of  counting.  As  illustrated  in  Example  3.9,  we  often 
have  an  outcome  that  can  be  represented  as  a  2-tuple  or  (zi,Z2),  where  zi  can  take 
on  one  of  N\  values  and  Z2  can  take  on  one  of  A'*2  values.  For  that  example,  the  total 
number  of  2-tuples  in  S  is  N1N2  =  5  -  5  —  25,  while  that  in  E  is  N1N2  =  3  •  2  =  6,  as 
can  be  verified  by  referring  to  Table  3.2.  It  is  important  to  note  that  order  matters 
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in  the  description  of  a  2-tuple.  For  example,  the  2-tuple  (1,2)  is  not  the  same  as  the 
2-tuple  (2, 1)  since  each  one  describes  a  different  outcome  of  the  experiment.  We  will 
frequently  be  using  2-tuples  and  more  generally  r-tuples  denoted  by  (z\,  z2, . . . ,  zr) 
to  describe  the  outcomes  of  urn  experiments. 

In  drawing  balls  from  an  urn  there  are  two  possible  strategies.  One  method  is  to 
draw  a  ball,  note  which  one  it  is,  return  it  to  the  urn,  and  then  draw  a  second  ball. 
This  is  called  sampling  with  replacement  and  was  used  in  Example  3.9.  However,  it 
is  also  possible  that  the  first  ball  is  not  returned  to  the  urn  before  the  second  one  is 
chosen.  This  method  is  called  sampling  without  replacement.  The  contrast  between 
the  two  strategies  is  illustrated  next. 

Example  3.10  —  Computing  probabilities  of  drawing  balls  from  urns  - 
with  and  without  replacement 

An  urn  has  k  red  balls  and  N  —  k  black  balls.  If  two  balls  are  chosen  in  succession 
and  at  random  with  replacement ,  what  is  the  probability  of  a  red  ball  followed  by  a 
black  ball?  We  solve  this  problem  by  first  labeling  the  k  red  balls  with  1, 2, . . . ,  k 
and  the  black  balls  with  k  +  1,  k  +  2, . . . ,  N.  In  doing  so  the  possible  outcomes  of 
the  experiment  can  be  represented  by  a  2-tuple  (zi,z2),  where  zi  E  {1,2,  .  ..,iV} 
and  Z2  E  {1,2,...,  N}.  A  successful  outcome  is  a  red  ball  followed  by  a  black  one 
so  that  the  successful  event  is  E  =  {(z\,  z2)  :  zi  =  1, . . . ,  fc;  z2  =  k  +  1, . . . ,  N}.  The 
total  number  of  2-tuples  in  the  sample  space  is  Ns  =  iV2,  while  the  total  number  of 
2-tuples  in  E  is  Ne  =  k(N  —  k)  so  that 

Ne 
N. s 

k(N  -  k) 


N2 


Note  that  if  we  let  p  =  k/N  be  the  proportion  of  red  balls,  then  P[E]  =  p(  1  —  p). 
Next  consider  the  case  of  sampling  without  replacement.  Now  since  the  same  ball 
cannot  be  chosen  twice  in  succession,  and  therefore,  z\  7^  z2,  we  have  one  fewer 
choice  for  the  second  ball.  Therefore,  Ns  =  N(N  -  1).  As  before,  the  number  of 
successful  2-tuples  is  Ne  =  k(N  —  £;),  resulting  in 


P[E 


k{N  -k)  _  k  N  —  k  N 
N(N  -  1)  “  N  N  N-  1 


p(i-p) 


N 

N-l 


The  probability  is  seen  to  be  higher.  Can  you  explain  this?  (It  may  be  helpful  to 
think  about  the  effect  of  a  successful  first  draw  on  the  probability  of  a  success  on 
the  second  draw.)  Of  course,  for  large  N  the  probabilities  for  sampling  with  and 
without  replacement  are  seen  to  be  approximately  the  same,  as  expected. 
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0 

If  we  now  choose  r  balls  without  replacement  from  an  urn  containing  N  balls,  then 
all  the  possible  outcomes  are  of  the  form  (zi,  £2,  . .  • ,  zr\  where  the  z£  s  must  be 
different.  On  the  first  draw  we  have  N  possible  balls,  on  the  second  draw  we  have 
N  —  1  possible  balls,  etc.  Hence,  the  total  number  of  possible  outcomes  or  number 
of  r-tuples  is  N(N  —  1)  •  •  •  (N  —  r  +  1).  We  denote  this  by  (N)r.  If  all  the  balls  are 
selected,  forming  an  iV-tuple,  then  the  number  of  outcomes  is 

(N)  N  =  N(N  —  1)  •  •  •  1 

which  is  defined  as  N\  and  is  termed  N  factorial.  As  an  example,  if  there  are  3 
balls  labeled  A,B,C,  then  the  number  of  3-tuples  is  3!  =  3  •  2  •  1  —  6.  To  verify  this 
we  have  by  enumeration  that  the  possible  3-tuples  are  (A,B,C),  (A,C,B),  (B,A,C), 
(B,C,A),  (C,A,B),  (C,B,A).  Note  that  3!  is  the  number  of  ways  that  3  objects  can 
be  arranged.  These  arrangements  are  termed  the  permutations  of  the  letters  A,  B, 
and  C.  Note  that  with  the  definition  of  a  factorial  we  have  that  (JV)r  =  N\/(N  —  r)\. 
Another  example  follows. 

Example  3.11  —  More  urns  -  using  permutations 

Five  balls  numbered  1, 2, 3, 4, 5  are  drawn  from  an  urn  without  replacement.  What 
is  the  probability  that  they  will  be  drawn  in  the  same  order  as  their  number?  Each 
outcome  is  represented  by  the  5-tuple  (21,22,23,24,25)-  The  only  outcome  in  E 
is  (1,2,  3, 4, 5)  so  that  N%  =  1.  To  find  Ns  we  require  the  number  of  ways  that 
the  numbers  1,2,  3, 4, 5  can  be  arranged  or  the  number  of  permutations.  This  is 
5!  =  120.  Hence,  the  desired  probability  is  P[E]  =  1/120. 

❖ 

Before  continuing,  we  give  one  more  example  to  explain  our  fixation  with  drawing 
balls  out  of  urns. 

Example  3.12  —  The  birthday  problem 

A  probability  class  has  N  students  enrolled.  What  is  the  probability  that  at  least 
two  of  the  students  will  have  the  same  birthday?  We  first  assume  that  each  student 
in  the  class  is  equally  likely  to  be  born  on  any  day  of  the  year.  To  solve  this 
problem  consider  a  “birthday  urn”  that  contains  365  balls.  Each  ball  is  labeled  with 
a  different  day  of  the  year.  Now  allow  each  student  to  select  a  ball  at  random,  note 
its  date,  and  return  it  to  the  urn.  The  day  of  the  year  on  the  ball  becomes  his/her 
birthday.  The  probability  desired  is  of  the  event  that  two  or  more  students  choose 
the  same  ball.  It  is  more  convenient  to  determine  the  probability  of  the  complement 
event  or  that  no  two  students  have  the  same  birthday.  Then,  using  Property  3.1 

P[at  least  2  students  have  same  birthday]  =  1— P[no  students  have  same  birthday]. 

The  sample  space  is  composed  of  Ns  =  365^  iV-tuples  (sampling  with  replacement). 
The  number  of  iV-tuples  for  which  all  the  outcomes  are  different  is  Ne  =  (365) n • 
This  is  because  the  event  that  no  two  students  have  the  same  birthday  occurs  if 
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the  first  student  chooses  any  of  the  365  balls,  the  second  student  chooses  any  of  the 
remaining  364  balls,  etc.,  which  is  the  same  as  if  sampling  without  replacement  were 
used.  The  probability  is  then 


P[at  least  2  students  have  same  birthday]  =  1 


(365)jy 
365^  * 


This  probability  is  shown  in  Figure  3.9  as  a  function  of  the  number  of  students.  It  is 
seen  that  if  the  class  has  23  or  more  students,  there  is  a  probability  of  0.5  or  greater 
that  two  students  will  have  the  same  birthday. 


Figure  3.9:  Probability  of  at  least  two  students  having  the  same  birthday. 


Why  this  doesn’t  appear  to  make  sense. 


This  result  may  seem  counterintuitive  at  first,  but  this  is  only  because  the  reader 
is  misinterpreting  the  question.  Most  persons  would  say  that  you  need  about  180 
people  for  a  50%  chance  of  two  identical  birthdays.  In  contrast,  if  the  question  was 
posed  as  to  the  probability  that  at  least  two  persons  were  born  on  January  1,  then 
the  event  would  be  at  least  two  persons  choose  the  ball  labeled  “January  1”  from  the 
birthday  urn.  For  23  people  this  probability  is  considerably  smaller  (see  Problem 
3.38).  It  is  the  possibility  that  the  two  identical  birthdays  can  occur  on  any  day 
of  the  year  (365  possibilities)  that  leads  to  the  unexpected  large  probability.  To 
verify  this  result  the  MATLAB  program  given  below  can  be  used.  When  run,  the 
estimated  probability  for  10,000  repeated  experiments  was  0.5072.  The  reader  may 
wish  to  reread  Section  2.4  at  this  point. 
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7,  birthday. m 

7. 

clear  all 
rand( *  state J ,0) 

BD= [0 : 365] > ; 

event=zeros ( 10000, 1)  ;  70  initialize  to  no  successful  events 
for  ntrial=l : 10000 
for  i=l:23 

x(i,l)=ceil(365*rand(l,l))  ;  70  chooses  birthdays  at  random 

7,  (ceil  rounds  up  to  nearest  integer) 

end 

y=sort(x);  7«  arranges  birthdays  in  ascending  order 
z=y(2:23)-y(l:22)  ;  7*  compares  successive  birthdays  to  each  other 
w=f ind(z==0)  ;  7o  flags  same  birthdays 
if  length  (w)>0 

event  (ntrial)=l ;  7«  event  occurs  if  one  or  more  birthdays  the  same 

end 

end 

prob=sum(event)/10000 


We  summarize  our  counting  formulas  so  far.  Each  outcome  of  an  experiment 
produces  an  r-tuple,  which  can  be  written  as  (zi,  £2,  •  •  •  ,  ^r)-  If  we  are  choos¬ 
ing  balls  in  succession  from  an  urn  containing  N  balls,  then  with  replacement 
each  Zi  can  take  on  one  of  N  possible  values.  The  number  of  possible  r-tuples 
is  then  Nr.  If  we  sample  without  replacement,  then  the  number  of  r-tuples  is  only 
(N)r  =  N(N  —  1)  •  •  •  (N  —  r  +  1).  If  we  sample  without  replacement  and  r  =  N 
or  all  the  balls  are  chosen,  then  the  number  of  r-tuples  is  N\.  In  arriving  at  these 
formulas  we  have  used  the  r-tuple  representation  in  which  the  ordering  is  used  in 
the  counting.  For  example,  the  3-tuple  (A,B,C)  is  different  than  (C,A,B),  which  is 
different  than  (C,B,A),  etc.  In  fact,  there  are  3!  possible  orderings  or  permutations 
of  the  letters  A,  B,  and  C.  We  are  frequently  not  interested  in  the  ordering  but  only 
in  the  number  of  distinct  elements.  An  example  might  be  to  determine  the  number 
of  possible  sum- values  that  can  be  made  from  one  penny  (p),  one  nickel  (n),  and 
one  dime  (d)  if  two  coins  are  chosen.  To  determine  this  we  use  a  tree  diagram  as 
shown  in  Figure  3.10.  Note  that  since  this  is  essentially  sampling  without  replace¬ 
ment,  we  cannot  have  the  outcomes  pp,  nn,  or  dd  (shown  in  Figure  3.10  as  dashed). 
The  number  of  possible  outcomes  are  3  for  the  first  coin  and  2  for  the  second  so 
that  as  usual  there  are  (3)2  =  3-2  =  6  outcomes.  However,  only  3  of  these  are 
distinct  or  produce  different  sum-values  for  the  two  coins.  The  outcome  (p,n)  is 
counted  the  same  as  (n,p)  for  example.  Hence,  the  ordering  of  the  outcome  does 
not  matter.  Both  orderings  are  treated  as  the  same  outcome.  To  remind  us  that 
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P, 


Figure  3.10:  Tree  diagram  enumerating  possible  outcomes. 


ordering  is  immaterial  we  will  replace  the  2-tuple  description  by  the  set  description 
(recall  that  the  elements  of  a  set  may  be  arranged  in  any  order  to  yield  the  same 
set).  The  outcomes  of  this  experiment  are  therefore  {p,n},  {p,d},  {n,d}.  In  effect, 
all  permutations  are  considered  as  a  single  combination.  Thus,  to  find  the  number 
of  combinations: 


Number  of  combinations  x  Number  of  permutations 


Total  number  of 
r-tuple  outcomes 


or  for  this  example, 


Number  of  combinations  x  2!  =  (3)2 


which  yields 


Number  of  combinations  = 


(3)2 

2! 


3! 

1!2! 


3. 


The  number  of  combinations  is  given  by  the  symbol  ( ^ )  and  is  said  to  be  “3  things 

taken  2  at  a  time” .  Also,  ( ^ )  is  termed  the  binomial  coefficient  due  to  its  appearance 
in  the  binomial  expansion  (see  Problem  3.43).  In  general  the  number  of  combinations 
of  N  things  taken  A;  at  a  time,  i.e.,  order  does  not  matter,  is 


N\  =  W*  =  iV! 
k)  k\  (N  —  k)\k\ ' 


Example  3.13  —  Correct  change 

If  a  person  has  a  penny,  nickel,  and  dime  in  his  pocket  and  selects  two  coins  at 
random,  what  is  the  probability  that  the  sum-value  will  be  6  cents?  The  sample 
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space  is  now  S  =  {{p,  n},  {p,  d},  {n,  d}}  and  E  —  {{p,  n}}.  Thus, 

P[6  cents]  =  P[{p,  n}]  = 

1 

3’ 

Note  that  each  simple  event  is  of  the  form  {■,  •}.  Also,  Ns  can  be  found  from  the 
original  problem  statement  as  Q)  =3. 

❖ 


Example  3.14  —  How  probable  is  a  royal  flush? 

A  person  draws  5  cards  from  a  deck  of  52  freshly  shuffled  cards.  What  is  the 
probability  that  he  obtains  a  royal  flush?  To  obtain  a  royal  flush  he  must  draw  an 
ace,  king,  queen,  jack,  and  ten  of  the  same  suit  in  any  order.  There  are  4  possible 
suits  that  will  be  produce  the  flush.  The  total  number  of  combinations  of  cards 
or  “hands”  that  can  be  drawn  is  (552)  and  a  royal  flush  will  result  from  4  of  these 
combinations.  Hence, 


P  [royal  flush]  = 


«  0.00000154. 


Ordered  vs.  unordered 


It  is  sometimes  confusing  that  (552)  is  used  for  Ns .  It  might  be  argued  that  the 
first  card  can  be  chosen  in  52  ways,  the  second  card  in  51  ways,  etc.  for  a  total  of 
(52)5  possible  outcomes.  Likewise,  for  a  royal  flush  in  hearts  we  can  choose  any  of 
5  cards,  followed  by  any  of  4  cards,  etc.  for  a  total  of  5!  possible  outcomes.  Hence, 
the  probability  of  a  royal  flush  in  hearts  should  be 


P  [royal  flush  in  hearts]  = 


5! 

(52)5 ' 


But  this  is  just  the  same  as  1/  (^2)  which  is  the  same  as  obtained  by  counting 
combinations.  In  essence,  we  have  reduced  the  sample  space  by  a  factor  of  5!  but 
additionally  each  event  is  commensurately  reduced  by  5!,  yielding  the  same  proba¬ 
bility.  Equivalently,  we  have  grouped  together  each  set  of  5!  permutations  to  yield 
a  single  combination. 


A 
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3.9  Binomial  Probability  Law 

In  Chapter  1  we  cited  the  binomial  probability  law  for  the  number  of  heads  obtained 
for  N  tosses  of  a  coin.  The  same  law  also  applies  to  the  problem  of  drawing  balls  from 
an  urn.  First,  however,  we  look  at  a  related  problem  that  is  of  considerable  practical 
interest.  Specifically,  consider  an  urn  consisting  of  a  proportion  p  of  red  balls  and  the 
remaining  proportion  1  —  p  of  black  balls.  What  is  the  probability  of  drawing  k  red 
balls  in  M  drawings  without  replacement ?  Note  that  we  can  associate  the  drawing 
of  a  red  ball  as  a  “success”  and  the  drawing  of  a  black  ball  as  a  “failure”.  Hence, 
we  are  equivalently  asking  for  the  probability  of  k  successes  out  of  a  maximum  of 
M  successes.  To  determine  this  probability  we  first  assume  that  the  urn  contains 
N  balls,  of  which  Nr  are  red  and  Nr  are  black.  We  sample  the  urn  by  drawing  M 
balls  without  replacement.  To  make  the  balls  distinguishable  we  label  the  red  balls 
as  1,2,...,  Nr  and  the  black  ones  as  Nr  +  1,  Nr  +  2, . . . ,  N.  The  sample  space  is 


S  =  {(^i,  Z2?  • . . ,  zm)  :  Z{  =  1, . . . ,  N  and  no  two  Z{  s  are  the  same}. 


We  assume  that  the  balls  are  selected  at  random  so  that  the  outcomes  are  equally 
likely.  The  total  number  of  outcomes  is  Ns  —  (N)m-  Hence,  the  probability  of 
obtaining  k  red  balls  is 


(3.25) 


Nr  is  the  number  of  M-tuples  that  contain  k  distinct  integers  in  the  range  from 
1  to  Nr  and  M  —  k  distinct  integers  in  the  range  Nr  +  1  to  N.  For  example,  if 
Nr  =  3,  Nr  =  4  (and  hence  N  =  7),  M  =  4,  and  k  =  2,  the  red  balls  are  contained 
in  (1, 2,  3},  the  black  balls  are  contained  in  {4, 5, 6, 7}  and  we  choose  4  balls  without 
replacement.  A  successful  outcome  has  two  red  balls  and  two  black  balls.  Some 
successful  outcomes  are  (1,4, 2, 5),  (1,4, 5, 2),  (1, 2, 4, 5),  etc.  or  (2, 3, 4, 6),  (2, 4, 3, 6), 
(2, 6, 3, 4),  etc.  Hence,  Nr  is  the  total  number  of  outcomes  for  which  two  of  the  Z{  s 
are  elements  of  {1,  2, 3}  and  two  of  the  z£ s  are  elements  of  (4, 5, 6,  7}.  To  determine 
this  number  of  successful  M-tuples  we 


1.  Choose  the  k  positions  of  the  M-tuple  to  place  the  red  balls.  (The  remaining 

positions  will  be  occupied  by  the  black  balls.) 

2.  Place  the  Nr  red  balls  in  the  k  positions  obtained  from  step  1. 

3.  Place  the  Nr  black  balls  in  the  remaining  M  —  k  positions. 

Step  1  is  accomplished  in  ways  since  any  permutation  of  the  chosen  positions 
produces  the  same  set  of  positions.  Step  2  is  accomplished  in  ( Nr )&  ways  and  step 
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3  is  accomplished  in  (Ns)M-k  ways.  Thus,  we  have  that 


y  £  1  ( NR)k(NB)M-k 
M1 

(. M=k)W-{Nr‘UNB)u-t 


(3.26) 


so  that  finally  we  have  from  (3.25) 


(3.27) 


This  law  is  called  the  hypergeometric  law  and  describes  the  probability  of  k  successes 
when  sampling  without  replacement  is  used.  If  sampling  with  replacement  is  used, 
then  the  binomial  law  results.  However,  instead  of  repeating  the  entire  derivation 
for  sampling  with  replacement,  we  need  only  assume  that  N  is  large.  Then,  whether 
the  balls  are  replaced  or  not  will  not  affect  the  probability.  To  show  that  this  is 
indeed  the  case,  we  start  with  the  expression  given  by  (3.26)  and  note  that  for  N 
large  and  M  <SC  N,  then  ( N)m  ~  NM .  Similarly,  we  assume  that  M  <C  Nr  and 
M  <C  Nr  and  make  similar  approximations.  As  a  result  we  have  from  (3.25)  and 
(3.26) 


P[k]  « 


NftN^f~k 

N™ 

m'ffl 


Letting  Nr/N  =  p  and  Nr /N  =  (N  —  Nr)/N  =  1  —  p,  we  have  at  last  the  binomial 
law 

^pk(l-p)M~k.  (3.28) 

To  summarize,  the  binomial  law  not  only  applies  to  the  drawing  of  balls  from  urns 
with  replacement  but  also  applies  to  the  drawing  of  balls  without  replacement  if  the 
number  of  balls  in  the  urn  is  large.  We  next  use  our  results  in  a  quality  control 
application. 
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3.10  Real-World  Example  —  Quality  Control 

A  manufacturer  of  electronic  memory  chips  produces  batches  of  1000  chips  for  ship¬ 
ment  to  computer  companies.  To  determine  if  the  chips  meet  specifications  the 
manufacturer  initially  tests  all  1000  chips  in  each  batch.  As  demand  for  the  chips 
grows,  however,  he  realizes  that  it  is  impossible  to  test  all  the  chips  and  so  proposes 
that  only  a  subset  or  sample  of  the  batch  be  tested.  The  criterion  for  acceptance 
of  the  batch  is  that  at  least  95%  of  the  sample  chips  tested  meet  specifications.  If 
the  criterion  is  met,  then  the  batch  is  accepted  and  shipped.  This  criterion  is  based 
on  past  experience  of  what  the  computer  companies  will  find  acceptable,  i.e.,  if  the 
batch  “yield”  is  less  than  95%  the  computer  companies  will  not  be  happy.  The 
production  manager  proposes  that  a  sample  of  100  chips  from  the  batch  be  tested 
and  if  95  or  more  are  deemed  to  meet  specifications,  then  the  batch  is  judged  to 
be  acceptable.  However,  a  quality  control  supervisor  argues  that  even  if  only  5  of 
the  sample  chips  are  defective,  then  it  is  still  quite  probable  that  the  batch  will  not 
have  a  95%  yield  and  thus  be  defective. 

The  quality  control  supervisor  wishes  to  convince  the  production  manager  that 
a  defective  batch  can  frequently  produce  5  or  fewer  defective  chips  in  a  chip  sample 
of  size  100.  He  does  so  by  determining  the  probability  that  a  defective  batch  will 
have  a  chip  sample  with  5  or  fewer  defective  chips  as  follows.  He  first  needs  to 
assume  the  proportion  of  chips  in  the  defective  batch  that  will  be  good.  Since 
a  good  batch  has  a  proportion  of  good  chips  of  95%,  a  defective  batch  will  have 
a  proportion  of  good  chips  of  less  than  95%.  Since  he  is  quite  conservative,  he 
chooses  this  proportion  as  exactly  p  =  0.94,  although  it  may  actually  be  less.  Then, 
according  to  the  production  manager  a  batch  is  judged  to  be  acceptable  if  the  sample 
produces  95, 96, 97, 98, 99,  or  100  good  chips.  The  quality  control  supervisor  likens 
this  problem  to  the  drawing  of  100  balls  from  an  “chip  urn”  containing  1000  balls. 
In  the  urn  there  are  lOOOp  good  balls  and  1000(1  —  p)  bad  ones.  The  probability  of 
drawing  95  or  more  good  balls  from  the  urn  is  given  approximately  by  the  binomial 
probability  law.  We  have  assumed  that  the  true  law,  which  is  hypergeometric  due 
to  the  use  of  sampling  without  replacement,  can  be  approximated  by  the  binomial 
law,  which  assumes  sampling  with  replacement.  See  Problem  3.48  for  the  accuracy 
of  this  approximation. 

Now  the  defective  batch  will  be  judged  as  acceptable  if  there  are  95  or  more 
successes  out  of  a  possible  100  draws.  The  probability  of  this  occurring  is 

P[k>95\=Y:(10k0)pk(i-p)ioo-k 

k- 95  '  ' 

where  p  —  0.94.  The  probability  P[k  >  95]  versus  p  is  plotted  in  Figure  3.11. 
For  p  =  0.94  we  see  that  the  defective  batch  will  be  accepted  with  a  probability 
of  about  0.45  or  almost  half  of  the  defective  batches  will  be  shipped.  The  quality 
control  supervisor  is  indeed  correct.  The  production  manager  does  not  believe  the 
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P 


Figure  3.11:  Probability  of  accepting  a  defective  batch  versus  proportion  of  good 
chips  in  the  defective  batch  -  accept  if  5  or  fewer  bad  chips  in  a  sample  of  100. 


result  since  it  appears  to  be  too  high.  Using  sampling  with  replacement,  which 
will  produce  results  in  accordance  with  the  binomial  law,  he  performs  a  computer 
simulation  (see  Problem  3.49).  Based  on  the  simulated  results  he  reluctantly  accepts 
the  supervisor’s  conclusions.  In  order  to  reduce  this  probability  the  quality  control 
supervisor  suggests  changing  the  acceptance  strategy  to  one  in  which  the  batch 
is  accepted  only  if  98  or  more  of  the  samples  meet  the  specifications.  Now  the 
probability  that  the  defective  batch  will  be  judged  as  acceptable  is 

>  w = T:  Cf)  ai  -  p)100-* 

k= 98  '  ' 


where  p  =  0.94,  the  assumed  proportion  of  good  chips  in  the  defective  batch.  This 
produces  the  results  shown  in  Figure  3.12.  The  acceptance  probability  for  a  defective 
batch  is  now  reduced  to  only  about  0.05. 

There  is  a  price  to  be  paid,  however,  for  only  accepting  a  batch  if  98  or  more  of 
the  samples  are  good.  Many  more  good  batches  will  be  rejected  than  if  the  previous 
strategy  were  used  (see  Problem  3.50).  This  is  deemed  to  be  a  reasonable  tradeoff. 
Note  that  the  supervisor  may  well  be  advised  to  examine  his  initial  assumption 
about  p  for  the  defective  batch.  If,  for  instance,  he  assumed  that  a  defective  batch 
could  be  characterized  by  p  —  0.9,  then  according  to  Figure  3.11,  the  production 
manager’s  original  strategy  would  produce  a  probability  of  less  than  0.1  of  accepting 
a  defective  batch. 
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Figure  3.12:  Probability  of  accepting  a  defective  batch  versus  proportion  of  good 
chips  in  the  defective  batch  -  accept  if  2  or  fewer  bad  chips  in  a  sample  of  100. 
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Problems 

3.1  (^)  (w)  The  universal  set  is  given  by  S  =  {x  :  — oo  <  x  <  00}  (the  real  line). 

If  A  =  {x  :  x  >  1}  and  B  =  {rr  :  x  <  2},  find  the  following: 

a.  Ac  and  Bc 

b.  A  U  B  and  An  B 

c.  A  —  B  and  B  —  A 

3.2  (w)  Repeat  Problem  3.1  if  S  =  {x  :  x  >  0}. 

3.3  (w)  A  group  of  voters  go  to  the  polling  place.  Their  names  and  ages  are  Lisa, 

21,  John,  42,  Ashley,  18,  Susan,  64,  Phillip,  58,  Fred,  48,  and  Brad,  26.  Find 
the  following  sets: 
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a.  Voters  older  than  30 

b.  Voters  younger  than  30 

c.  Male  voters  older  than  30 

d.  Female  voters  younger  than  30 

e.  Voters  that  are  male  or  younger  than  30 

f.  Voters  that  are  female  and  older  than  30 

Next  find  any  two  sets  that  partition  the  universe. 

3.4  (w)  Given  the  sets  A*  =  {x  :  0  <  x  <  %}  for  i  =  1, 2, . . . ,  TV,  find  U  f-iM  and 

fl Are  the  A^’s  disjoint? 

3.5  (w)  Prove  that  the  sets  A  =  {x  :  x  >  —1}  and  B  =  {x  :  2x  +  2  >  0}  are  equal. 

3.6  (t)  Prove  that  if  x  G  A  fl  Bc,  then  x  G  A  —  B. 

3.7  (o)  (w)  If  <5  =  {1, 2, 3, 4, 5,6},  find  sets  A  and  B  that  are  disjoint.  Next  find 

sets  C  and  D  that  partition  the  universe. 

3.8  (w)  If  S  =  {(x,y)  :  0  <  x  <  1  and  0  <  y  <  1},  find  sets  A  and  B  that  are 

disjoint.  Next  find  sets  C  and  D  that  partition  the  universe. 

3.9  (t)  In  this  problem  we  see  how  to  construct  disjoint  sets  from  ones  that  are  not 

disjoint  so  that  their  unions  will  be  the  same.  We  consider  only  three  sets  and 
ask  the  reader  to  generalize  the  result.  Calling  the  nondisjoint  sets  A,  B ,  C 
and  the  union  D  =  A  U  B  U  C,  we  wish  to  find  three  disjoint  sets  E\,  E 2,  and 
E%  so  that  D  =  E\  U  E2  U  E% .  To  do  so  let 

Ei  —  A 
E2  —  B  —  E\ 

E%  —  C  —  {E\  U  E2)' 

Using  a  Venn  diagram  explain  this  procedure.  If  we  now  have  sets  A\ ,  A2, . . . ,  Aw 
explain  how  to  construct  N  disjoint  sets  with  the  same  union. 

3.10  (o)(f)  Replace  the  set  expression  AUBUC  with  one  using  intersections  and 
complements.  Replace  the  set  expression  AnBDC  with  one  using  unions  and 
complements. 

3.11  (w)  The  sets  A,  S,  C  are  subsets  of  S  =  {{x,y)  :  0  <  x  <  1  and  0  <  y  <  1}. 
They  are  defined  as 


=  {(®,y)  :  X  <  1/2,0  <  y  <  1} 

=  {(x,y)  ■  x  >  1/2,0  <  y  <  1} 

=  {(x,y)  :  0  <  x  <  l,y  <  1/2}. 


A 

B 

C 
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Explicitly  determine  the  set  A  U  (B  fl  C)c  by  drawing  a  picture  of  it  as  well  as 
pictures  of  all  the  individual  sets.  For  simplicity  you  can  ignore  the  edges  of 
the  sets  in  drawing  any  diagrams.  Can  you  represent  the  resultant  set  using 
only  unions  and  complements? 

3.12  (^)  (w)  Give  the  size  of  each  set  and 
If  the  set  is  infinite,  determine  if  it  is 

a.  A  =  {seven-digit  numbers} 

b.  B  —  {x  :  2x  =  1} 

c.  C  =  {x  :  0  <  x  <1  and  1/2  <  x  < 

d.  D  =  {(x,  y)  :  x2  +  y2  =  1} 

e.  E  =  {x  :  x2  +  3x  +  2  =  0} 

f .  F  —  {positive  even  integers} 

3.13  (w)  Two  dice  are  tossed  and  the  number  of  dots  on  each  side  that  come  up 
are  added  together.  Determine  the  sample  space,  outcomes,  impossible  event, 
three  different  events  including  a  simple  event,  and  two  mutually  exclusive 
events.  Use  appropriate  set  notation. 

3.14  (o)  (w)  The  temperature  in  Rhode  Island  on  a  given  day  in  August  is  found 
to  always  be  in  the  range  from  30°  F  to  100°  F.  Determine  the  sample  space, 
outcomes,  impossible  event,  three  different  events  including  a  simple  event, 
and  two  mutually  exclusive  events.  Use  appropriate  set  notation. 

3.15  (t)  Prove  that  if  the  sample  space  has  size  iV,  then  the  total  number  of  events 
(including  the  impossible  event  and  the  certain  event)  is  2N .  Hint:  There  are 

{jz')  ways  to  choose  an  event  with  k  outcomes  from  a  total  of  N  outcomes. 
Also,  use  the  binomial  formula 


also  whether  it  is  discrete  or  continuous, 
countably  infinite  or  not. 

2} 


(a  +  b)N 


akbN~k 


which  was  proven  in  Problem  1.11. 

3.16  (w)  An  urn  contains  2  red  balls  and  3  black  balls.  The  red  balls  are  labeled 
with  the  numbers  1  and  2  and  the  black  balls  are  labeled  as  3,  4,  and  5.  Three 
balls  are  drawn  without  replacement.  Consider  the  events  that 

A  —  {a  majority  of  the  balls  drawn  are  black} 

B  =  {the  sum  of  the  numbers  of  the  balls  drawn  >  10}. 

Are  these  events  mutually  exclusive?  Explain  your  answer. 
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3.17  (t)  Prove  Axiom  3'  by  using  mathematical  induction  (see  Appendix  B)  and 
Axiom  3. 

3.18  (^)  (w)  A  roulette  wheel  has  numbers  1  to  36  equally  spaced  around  its 
perimeter.  The  odd  numbers  are  colored  red  while  the  even  numbers  are 
colored  black.  If  a  spun  ball  is  equally  likely  to  yield  any  of  the  36  numbers, 
what  is  the  probability  of  a  black  number,  of  a  red  number?  What  is  the 
probability  of  a  black  number  that  is  greater  than  24?  What  is  the  probability 
of  a  black  number  or  a  number  greater  than  24? 

3.19  (0)(c)  Use  a  computer  simulation  to  simulate  the  tossing  of  a  fair  die.  Based 
on  the  simulation  what  is  the  probability  of  obtaining  an  even  number?  Does 
it  agree  with  the  theoretical  result?  Hint:  See  Section  2.4. 

3.20  (w)  A  fair  die  is  tossed.  What  is  the  probability  of  obtaining  an  even  number, 
an  odd  number,  a  number  that  is  even  or  odd,  a  number  that  is  even  and  odd? 

3.21  (o)  (w)  A  die  is  tossed  that  yields  an  even  number  with  twice  the  probability 
of  yielding  an  odd  number.  What  is  the  probability  of  obtaining  an  even 
number,  an  odd  number,  a  number  that  is  even  or  odd,  a  number  that  is  even 
and  odd? 


3.22  (w)  If  a  single  letter  is  selected  at  random  from  {A,  J5,  (7},  find  the  probability 
of  all  events.  Recall  that  the  total  number  of  events  is  2^,  where  N  is  the 
number  of  simple  events.  Do  these  probabilities  sum  to  one?  If  not,  why  not? 
Hint:  See  Problem  3.15. 


3.23  (^)  (w)  A  number  is  chosen  from  {1, 2, 3, . . .}  with  probability 


m  =  < 


i  =  1 

i  =  2 
i  >  3 


Find  P[i  >  4]. 

3.24  (f)  For  a  sample  space  S  =  {0, 1, 2, . . .}  the  probability  assignment 

P\i]  =  exp(— 2)^r 

l\ 

is  proposed.  Is  this  a  valid  assignment? 

3.25  (o)  (w)  Two  fair  dice  are  tossed.  Find  the  probability  that  only  one  die 
comes  up  a  6. 
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3.26  (w)  A  circuit  consists  of  N  switches  in  parallel  (see  Example  3.6  for  N  =  2). 
The  sample  space  can  be  summarized  as  S  =  {(^i,  ^2?  ■  ■  ■ ,  zn)  :  Z{  =  s  or  f}, 
where  s  indicates  a  success  or  the  switch  closes  and  f  indicates  a  failure  or 
the  switch  fails  to  close.  Assuming  that  all  the  simple  events  are  equally 
likely,  what  is  the  probability  that  a  circuit  is  closed  when  all  the  switches  are 
activated  to  close?  Hint:  Consider  the  complement  event. 

3.27  (^)  (w)  Can  the  series  circuit  of  Figure  3.7  ever  outperform  the  parallel  cir¬ 
cuit  of  Figure  3.6  in  terms  of  having  a  higher  probability  of  closing  when  both 
switches  are  activated  to  close?  Assume  that  switch  1  closes  with  probability 
p,  switch  2  closes  with  probability  p,  and  both  switches  close  with  probability 

p2. 

3.28  (w)  Verify  the  formula  (3.20)  for  P[E\  U  E2  U  E3]  if  E\^E^  E%  are  events  that 
are  not  necessarily  mutually  exclusive.  To  do  so  use  a  Venn  diagram. 

3.29  (t)  Prove  that 

P[EiE2]  +  P[EiEk]  +  P[E2E3\  >  P[E1E2E3\. 

3.30  (w)  A  person  always  arrives  at  his  job  between  8:00  AM  and  8:20  AM.  He  is 
equally  likely  to  arrive  anytime  within  that  period.  What  is  the  probability 
that  he  will  arrive  at  8:10  AM?  What  is  the  probability  that  he  will  arrive 
between  8:05  and  8:10  AM? 

3.31  (w)  A  random  number  generator  produces  a  number  that  is  equally  likely  to 
be  anywhere  in  the  interval  (0, 1).  What  are  the  simple  events?  Can  you  use 
(3.10)  to  find  the  probability  that  a  generated  number  will  be  less  than  1/2? 
Explain. 

3.32  (w)  If  two  fair  dice  are  tossed,  find  the  probability  that  the  same  number  will 
be  observed  on  each  one.  Next,  find  the  probability  that  different  numbers 
will  be  observed. 

3.33  (^)  (w)  Three  fair  dice  are  tossed.  Find  the  probability  that  2  of  the  numbers 
will  be  the  same  and  the  third  will  be  different. 

3.34  (w,c)  An  urn  contains  4  red  balls  and  2  black  balls.  Two  balls  are  chosen  at 
random  and  without  replacement.  What  is  the  probability  of  obtaining  one 
red  ball  and  one  black  ball  in  any  order?  Verify  your  results  by  enumerating 
all  possibilities  using  a  computer  evaluation. 

3.35  (o)  (f)  Rhode  Island  license  plate  numbers  are  of  the  form  GR315  (2  letters 
followed  by  3  digits).  How  many  different  license  plates  can  be  issued? 
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3.36  (f)  A  baby  is  to  be  named  using  four  letters  of  the  alphabet.  The  letters  can 
be  used  as  often  as  desired.  How  many  different  names  are  there?  (Of  course, 
some  of  the  names  may  not  be  pronounceable). 

3.37  (c)  It  is  difficult  to  compute  N\  when  N  is  large.  As  an  approximation,  we 
can  use  Stirling’s  formula,  which  says  that  for  large  N 

N\  «  V2ttNn+1/2  exp(-N). 

Compare  Stirling’s  approximation  to  the  true  value  of  N\  for  N  =  1, 2, ... ,  100 
using  a  digital  computer.  Next  try  calculating  the  exact  value  of  Nl  for  N  = 
200  using  a  computer.  Hint:  Try  printing  out  the  logarithm  of  N\  and  compare 
it  to  the  logarithm  of  its  approximation. 

3.38  (^)  (t)  Determine  the  probability  that  in  a  class  of  23  students  two  or  more 
students  have  birthdays  on  January  1. 

3.39  (c)  Use  a  computer  simulation  to  verify  your  result  in  Problem  3.38. 

3.40  (^)  (w)  A  pizza  can  be  ordered  with  up  to  four  different  toppings.  Find  the 
total  number  of  different  pizzas  (including  no  toppings)  that  can  be  ordered. 
Next,  if  a  person  wishes  to  pay  for  only  two  toppings,  how  many  two-topping 
pizzas  can  he  order? 

3.41  (f)  How  many  subsets  of  size  three  can  be  made  from  {A,  jE?,  (7,  D,  E}1 

3.42  (w)  List  all  the  combinations  of  two  coins  that  can  be  chosen  from  the  follow¬ 
ing  coins:  one  penny  (p),  one  nickel  (n),  one  dime  (d),  one  quarter  (q).  What 
are  the  possible  sum- values? 

3.43  (f)  The  binomial  theorem  states  that 


(a  +  b)N 


akbN~k. 


Expand  (a  +  5) 3  and  (a  +  6) 4  into  powers  of  a  and  b  and  compare  your  results 
to  the  formula. 


3.44  (o)  (w)  A  deck  of  poker  cards  contains  an  ace,  king,  queen,  jack,  10,  9,  8, 
7,  6,  5,  4,  3,  2  in  each  of  the  four  suits,  hearts  (h),  clubs  (c),  diamonds  (d), 
and  spades  (s),  for  a  total  of  52  cards.  If  5  cards  are  chosen  at  random  from 
a  deck,  find  the  probability  of  obtaining  4  of  a  kind,  as  for  example,  8-h,  8-c, 
8-d,  8-s,  9-c.  Next  find  the  probability  of  a  flush,  which  occurs  when  all  five 
cards  have  the  same  suit,  as  for  example,  8-s,  queen-s,  2-s,  ace-s,  5-s. 
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3.45  (w)  A  class  consists  of  30  students,  of  which  20  are  freshmen  and  10  are 
sophomores.  If  5  students  are  selected  at  random,  what  is  the  probability  that 
they  will  all  be  sophomores? 

3.46  (w)  An  urn  containing  an  infinite  number  of  balls  has  a  proportion  p  of  red 
balls,  and  the  remaining  portion  1  —  p  of  black  balls.  Two  balls  are  chosen  at 
random.  What  value  of  p  will  yield  the  highest  probability  of  obtaining  one 
red  ball  and  one  black  ball  in  any  order? 

3.47  (w)  An  urn  contains  an  infinite  number  of  coins  that  are  either  two-headed  or 
two-tailed.  The  proportion  of  each  kind  is  the  same.  If  we  choose  M  coins  at 
random,  explain  why  the  probability  of  obtaining  k  heads  is  given  by  (3.28) 
with  p  —  1/2.  Also,  how  does  this  experiment  compare  to  the  tossing  of  a  fair 
coin  M  times? 


3.48  (c)  Compare  the  hypergeometric  law  to  the  binomial  law  if  TV  =  1000,  M  = 
100,  p  =  0.94  by  calculating  the  probability  P[k]  for  k  =  95, 96, . . . ,  100. 
Hint:  To  avoid  computational  difficulties  of  calculating  N\  for  large  iV,  use 
the  following  strategy  to  find  x  =  10001/900!  as  an  example. 


1000 


900 


y  =  \n{x)  =  ln(1000!)  -  ln(900!)  =  V  ln(i)  -  Vln (<) 


i=l 


i= 1 


and  then  x  =  exp(y).  Alternatively,  for  this  example  you  can  cancel  out  the 
common  factors  in  the  quotient  of  x  and  write  it  as  x  =  (1000)  ioo,  which  is 
easier  to  compute.  But  in  general,  this  may  be  more  difficult  to  set  up  and 
program. 

3.49  (o)  (c)  A  defective  batch  of  1000  chips  contains  940  good  chips  and  60  bad 
chips.  If  we  choose  a  sample  of  100  chips,  find  the  probability  that  there  will 
be  95  or  more  good  chips  by  using  a  computer  simulation.  To  simpify  the 
problem  assume  sampling  with  replacement  for  the  computer  simulation  and 
the  theoretical  probability.  Compare  your  result  to  the  theoretical  prediction 
in  Section  3.10. 


3.50  (c)  For  the  real-world  problem  discussed  in  Section  3.10  use  a  computer  simu¬ 
lation  to  determine  the  probability  of  rejecting  a  good  batch.  To  simpify  your 
code  assume  sampling  with  replacement.  A  good  batch  is  defined  as  one  with 
a  probability  of  obtaining  a  good  chip  of  p  =  0.95.  The  two  strategies  are  to 
accept  the  batch  if  95  or  more  of  the  100  samples  are  good  and  if  98  or  more 
of  the  100  samples  are  good.  Explain  your  results.  Can  you  use  Figures  3.11 
and  3.12  to  determine  the  theoretical  probabilities? 


Chapter  4 


Conditional  Probability 


4.1  Introduction 

In  the  previous  chapter  we  determined  the  probabilities  for  some  simple  experiments. 
An  example  was  the  die  toss  that  produced  a  number  from  1  to  6  “at  random”. 
Hence,  a  probability  of  1/6  was  assigned  to  each  possible  outcome.  In  many  real- 
world  “experiments” ,  the  outcomes  are  not  completely  random  since  we  have  some 
prior  knowledge.  For  instance,  knowing  that  it  has  rained  the  previous  2  days  might 
influence  our  assignment  of  the  probability  of  sunshine  for  the  following  day.  Another 
example  is  to  determine  the  probability  that  an  individual  chosen  from  some  general 
population  weighs  more  than  200  lbs.,  knowing  that  his  height  exceeds  6  ft.  This 
motivates  our  interest  in  how  to  determine  the  probability  of  an  event,  given  that  we 
have  some  prior  knowledge.  For  the  die  tossing  experiment  we  might  inquire  as  to  the 
probability  of  obtaining  a  4,  if  it  is  known  that  the  outcome  is  an  even  number.  The 
additional  knowledge  should  undoubtedly  change  our  probability  assignments.  For 
example,  if  it  is  known  that  the  outcome  is  an  even  number,  then  the  probability 
of  any  odd-numbered  outcome  must  be  zero.  It  is  this  interaction  between  the 
original  probabilities  and  the  probabilities  in  light  of  prior  knowledge  that  we  wish 
to  describe  and  quantify,  leading  to  the  concept  of  a  conditional  probability. 

4.2  Summary 

Section  4.3  motivates  and  then  defines  the  conditional  probability  as  (4.1).  In  do¬ 
ing  so  the  concept  of  a  joint  event  and  its  probability  are  introduced  as  well  as 
the  marginal  probability  of  (4.3).  Conditional  probabilities  can  be  greater  than, 
less  than,  or  equal  to  the  ordinary  probability  as  illustrated  in  Figure  4.2.  Also, 
conditional  probabilities  are  true  probabilities  in  that  they  satisfy  the  basic  axioms 
and  so  can  be  manipulated  in  the  usual  ways.  Using  the  law  of  total  probability 
(4.4),  the  probabilities  for  compound  experiments  are  easily  determined.  When  the 
conditional  probability  is  equal  to  the  ordinary  probability,  the  events  are  said  to 
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be  statistically  independent.  Then,  knowledge  of  the  occurrence  of  one  event  does 
not  change  the  probability  of  the  other  event.  The  condition  for  two  events  to 
be  independent  is  given  by  (4.5).  Three  events  are  statistically  independent  if  the 
conditions  (4.6)-(4.9)  hold.  Bayes’  theorem  is  defined  by  either  (4.13)  or  (4.14). 
Embodied  in  the  theorem  are  the  concepts  of  a  prior  probability  (before  the  experi¬ 
ment  is  conducted)  and  a  posterior  probability  (after  the  experiment  is  conducted). 
Conclusions  may  be  drawn  based  on  the  outcome  of  an  experiment  as  to  whether 
certain  hypotheses  are  true.  When  an  experiment  is  repeated  multiple  times  and 
the  experiments  are  independent,  the  probability  of  a  joint  event  is  easily  found 
via  (4.15).  Some  probability  laws  that  result  from  the  independent  multiple  experi¬ 
ment  assumption  are  the  binomial  (4.16),  the  geometric  (4.17),  and  the  multinomial 
(4.19).  For  dependent  multiple  experiments  (4.20)  must  be  used  to  determine  prob¬ 
abilities  of  joint  events.  If,  however,  the  experimental  outcomes  probabilities  only 
depend  on  the  previous  experimental  outcome,  then  the  Markov  condition  is  satis¬ 
fied.  This  results  in  the  simpler  formula  for  determining  joint  probabilities  given  by 
(4.21).  Also,  this  assumption  leads  to  the  concept  of  a  Markov  chain,  an  example  of 
which  is  shown  in  Figure  4.8.  Finally,  in  Section  4.7  an  example  of  the  use  of  Bayes’ 
theorem  to  detect  the  presence  of  a  cluster  is  investigated. 

4.3  Joint  Events  and  the  Conditional  Probability 

In  formulating  a  useful  theory  of  conditional  probability  we  are  led  to  consider 
two  events.  Event  A  is  our  event  of  interest  while  event  B  represents  the  event 
that  embodies  our  prior  knowledge.  For  the  fair  die  toss  example  described  in  the 
introduction,  the  event  of  interest  is  A  =  {4}  and  the  event  describing  our  prior 
knowledge  is  an  even  outcome  or  B  =  {2,4,6}.  Note  that  when  we  say  that  the 
outcome  must  be  even,  we  do  not  elaborate  on  why  this  is  the  case.  It  may  be 
because  someone  has  observed  the  outcome  of  the  experiment  and  conveyed  this 
partial  information  to  us.  Alternatively,  it  may  be  that  the  experimenter  loathes 
odd  outcomes,  and  therefore  keeps  tossing  the  die  until  an  even  outcome  is  obtained. 
Conditional  probability  does  not  address  the  reasons  for  the  prior  information,  only 
how  to  accommodate  it  into  a  probabilistic  framework.  Continuing  with  the  fair 
die  example,  a  typical  sequence  of  outcomes  for  a  repeated  experiment  is  shown  in 
Figure  4.1.  The  odd  outcomes  are  shown  as  dashed  lines  and  are  to  be  ignored. 
From  the  figure  we  see  that  the  probability  of  a  4  is  about  9/25  =  0.36,  or  about 
1/3,  using  a  relative  frequency  interpretation  of  probability.  This  has  been  found 
by  taking  the  total  number  of  4’s  and  dividing  by  the  total  number  of  2’s,  4’s,  and 
6’s.  Specifically,  we  have  that 

Na=9_ 

Nb  25* 

Another  problem  might  be  to  determine  the  probability  of  A  =  {1,4},  knowing 
that  the  outcome  is  even.  In  this  case,  we  should  use  Nahb/Nb  to  make  sure  we 
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Figure  4.1:  Outcomes  for  repeated  tossing  of  a  fair  die. 


only  count  the  outcomes  that  can  occur  in  light  of  our  knowledge  of  B.  For  this 
example,  only  the  4  in  {1,4}  could  have  occurred.  If  an  outcome  is  not  in  B ,  then 
that  outcome  will  not  be  included  in  A  n  B  and  will  not  be  counted  in  Nahb ■  Now 
letting  S  =  {1, 2, 3, 4, 5, 6}  be  the  sample  space  and  Ns  its  size,  the  probability  of 
A  given  B  is 

Nahb  _  _  P{A  n  R] 

N„  *  ~  P\B\  ■ 


This  is  termed  the  conditional  probability  and  is  denoted  by  P[A\B]  so  that  we  have 
as  our  definition 


P[A\B]  = 


P[A  n  B 
P\B } 


Note  that  to  determine  it,  we  require  P[A  fl  B]  which  is  the  probability  of  both  A 
and  B  occurring  or  the  probability  of  the  intersection.  Intuitively,  the  conditional 
probability  is  the  proportion  of  time  A  and  B  occurs  divided  by  the  proportion  of 
time  that  B  occurs.  The  event  B  =  {2, 4, 6}  comprises  a  new  sample  space  and  is 
sometimes  called  the  reduced  sample  space.  The  denominator  term  in  (4.1)  serves  to 
normalize  the  conditional  probabilities  so  that  the  probability  of  the  reduced  sample 
space  is  one  (set  A  =  B  in  (4.1)).  Returning  to  the  die  toss,  the  probability  of  a  4, 
given  that  the  outcome  is  even,  is  found  as 


AC\B  =  {4}  fl  {2, 4, 6}  =  {4}  =  A 
B  =  {2,4,6} 
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Wi 

100-130 

w2 

130-160 

w3 

160-190 

Wi 

190-220 

w5 

220-250 

P[Hi] 

Hi  5'-  5' 4" 

0.08 

0.04 

0.02 

0 

0 

0.14 

H2  5' 4"-  5' 8" 

0.06 

0.12 

0.06 

0.02 

0 

0.26 

H3  5' 8"-  6' 

0 

0.06 

0.14 

0.06 

0 

0.26 

Hi  6  -  6' 4" 

0 

0.02 

0.06 

0.10 

0.04 

0.22 

H5  6' 4"-  6' 8" 

0 

0 

0 

0.08 

0.04 

0.12 

Table  4.1:  Joint  probabilities  for  heights  and  weights  of  college  students. 


and  therefore 


P[A\ B] 


P[A  n  B]  _  P[A] 

P[B }  ~  P\B) 
1/6  _  1 
376  “  3 


as  expected.  Note  that  P[A  D  B]  and  P[B ]  are  computed  based  on  the  original 
sample  space ,  S. 

The  event  A  D  B  is  usually  called  the  joint  event  since  both  events  must  occur 
for  a  nonempty  intersection.  Likewise,  P[A  fl  B]  is  termed  the  joint  probability ,  but 
of  course,  it  is  nothing  more  than  the  probability  of  an  intersection.  Also,  P[A ] 
is  called  the  marginal  probability  to  distinguish  it  from  the  joint  and  conditional 
probabilities.  The  reason  for  this  terminology  will  be  discussed  shortly. 

In  defining  the  conditional  probability  of  (4.1)  it  is  assumed  that  P[B]  ^  0.  Oth¬ 
erwise,  theoretically  and  practically,  the  definition  would  not  make  sense.  Another 
example  follows. 

Example  4.1  -  Heights  and  weights  of  college  students 

A  population  of  college  students  have  heights  H  and  weights  W  which  are  grouped 
into  ranges  as  shown  in  Table  4.1.  The  table  gives  the  joint  probability  of  a  student 
having  a  given  height  and  weight,  which  can  be  denoted  as  P[Hif]Wj\.  For  example, 
if  a  student  is  selected,  the  probability  of  his/her  height  being  between  5' 4"  and  5' 8” 
and  also  his/her  weight  being  between  130  lbs.  and  160  lbs.  is  0.12.  Now  consider  the 
event  that  the  student  has  a  weight  in  the  range  130-160  lbs.  Calling  this  event  A 
we  next  determine  its  probability.  Since  A  =  {(if,  W)  :  H  =  . . . ,  i?5;  W  =  VF2}, 

it  is  explicitly 


A  =  {(Hi,  W2),  (H2,  W2),  (H3,  W2),  (H^  W2),  (H5,  W2)} 

and  since  the  simple  events  are  by  definition  mutually  exclusive,  we  have  by  Axiom 
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3'  (see  Section  3.4) 

5 

P[A]  =  £>[(#,  W2)\  =  0.04  +  0.12  +  0.06  +  0.02  +  0 
2  =  1 

=  0.24. 


Next  we  determine  the  probability  that  a  student’s  weight  is  in  the  range  of  130-160 
lbs.,  given  that  the  student  has  height  less  than  6  .  The  event  of  interest  A  is  the 
same  as  before.  The  conditioning  event  is  B  =  {(II,  W)  :  H  =  Hi,  H2,  W  = 
Wi,...,  W5}  so  that  A  D  B  =  {(HuW2),  (H2,  W2),  ( Hz ,  W2)}  and 


P[A\B] 


P[A  fl  B]  __  0.04  +  0.12  +  0.06 
P[B]  ~  0.14  +  0.26  +  0.26 
=  0.33. 


We  see  that  it  is  more  probable  that  the  student  has  weight  between  130  and  160 
lbs.  if  it  is  known  beforehand  that  his/her  height  is  less  than  6  .  Note  that  in  finding 
P[B]  we  have  used 

3  5 

=  (4-2) 

2=1  j~l 

which  is  determined  by  first  summing  along  each  row  to  produce  the  entries  shown 
in  Table  4.1  as  P[H{].  These  are  given  by 

5 

P[Hi\  =  J2pm,wj)}  (4.3) 

3= 1 

and  then  summing  the  P[iJ*]’s  for  i  =  1,2,3.  Hence,  we  could  have  written  (4.2) 
equivalently  as 

3 

P[B]  =  Y/P[Hl]- 

2=1 

The  probabilities  P[Hj\  are  called  the  marginal  probabilities  since  they  are  written 
in  the  margin  of  the  table.  If  we  were  to  sum  along  the  columns,  then  we  would 
obtain  the  marginal  probabilities  for  the  weights  or  P[Wj}.  These  are  given  by 

5 

P[Wj]  =  Y,P[{HuWj)}. 

2=1 

It  is  important  to  observe  that  by  utilizing  the  information  that  the  student’s 
height  is  less  than  6  ,  the  probability  of  the  event  has  changed;  in  this  case,  it 
has  increased  from  0.24  to  0.33.  It  is  also  possible  that  the  opposite  may  occur. 
If  we  were  to  determine  the  probability  that  the  student’s  weight  is  in  the  range 
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130-160  lbs.,  given  that  he/she  has  a  height  greater  than  6  ,  then  defining  the 
conditioning  event  as  B  =  {(H,W)  :  H  =  H,\ ,  H-, ;  W  =  W\ .....  W5 }  and  noting 
that  A  fl  B  =  {{H4,  W2),  (i?5,  W2}  we  have 


P[A\B] 


0.02  +  0 

0.22  +  0.12 

0.058. 


Hence,  the  conditional  probability  has  now  decreased  with  respect  to  the  uncondi¬ 
tional  probability  or  P[A], 

0 


In  general  we  may  have 


P[A\B]  >  P[A] 

P[A\B]  <  P[A ] 

P[A\B]  =  P[A). 

See  Figure  4.2  for  another  example.  The  last  possibility  is  of  particular  interest  since 


(a) 

2/3  =  P[A\B]  >  P[A]  =  1/2 


(b) 

1/3  =  P[A\B]  <  P[A]  =  1/2 


(c) 

1/2  =  P[A\B]  =  P[A]  =  1/2 


Figure  4.2:  Illustration  of  possible  relationships  of  conditional  probability  to  ordi¬ 
nary  probability. 

it  states  that  the  probability  of  an  event  A  is  the  same  whether  or  not  we  know  that 
B  has  occurred.  In  this  case,  the  event  A  is  said  to  be  statistically  independent  of 
the  event  B.  In  the  next  section,  we  will  explore  this  further. 

Before  proceeding,  we  wish  to  emphasize  that  a  conditional  probability  is  a  true 
probability  in  that  it  satisfies  the  axioms  described  in  Chapter  3.  As  a  result,  all  the 
rules  that  allow  one  to  manipulate  probabilities  also  apply  to  conditional  probabili¬ 
ties.  For  example,  since  Property  3.1  must  hold,  it  follows  that  P[AC|B]  =  1—P[A\B] 
(see  also  Problem  4.10).  To  prove  that  the  axioms  are  satisfied  for  conditional  prob¬ 
abilities  we  first  assume  that  the  axioms  hold  for  ordinary  probabilities.  Then, 


4.3.  JOINT  EVENTS  AND  THE  CONDITIONAL  PROBABILITY 


79 


Axiom  1 


P[A\B]  = 


P[A  n  B } 
P[B] 


>  0 


since  P[A  fl  B]  >  0  and  P[B ]  >  0. 


Axiom  2 


P[S\B] 


P[S  n  B]  _  P[B] 
P[B]  ~  PpJ 


1. 


Axiom  3  If  A  and  C  are  mutually  exclusive  events,  then 


P[A  U  C\B]  = 


P[(A  U  C)  n  B] 

P[B } 

P[(A  nB)u(Cn  B)] 
P[B } 

P[A  n  B}  +  P[C  n  B] 
P[B] 

P[A\B }  +  P[C\B ] 


(definition) 

(distributive  property) 

(Axiom  3  for  ordinary  probability, 

A  n  c  =  0  =►  {A  n  B)  n  (c  n  b)  =  0) 

(definition  of  conditional  probability). 


Conditional  probabilities  are  useful  in  that  they  allow  us  to  simplify  probability 
calculations.  One  particularly  important  relationship  based  on  conditional  proba¬ 
bility  is  described  next.  Consider  a  partitioning  of  the  sample  space  S.  Recall  that 
a  partition  is  defined  as  a  group  of  sets  Bi,B2,  . . . ,  B^  such  that  S  =  \jf=lBi  and 
B,  fl  Bj  0  for  i  7^  j.  Then  we  can  rewrite  the  probability  P[A\  as 

P[A]  =  P[A  n  S]  =  P  [A  n  (uj^Bi)]  . 

But  by  a  slight  extension  of  the  distributive  property  of  sets,  we  can  express  this  as 

-P[A]  =  P[{A  n  B\)  u  [A  n  B2)  u  •  •  •  u  (A  n 

Since  the  B{  s  are  mutually  exclusive,  then  so  are  the  A  fl  B{  s,  and  therefore 

N 

P[A]  =  E  n 

i=  1 


or  finally 


N 

P[A}  =  J2P[A\Bi]P[Bi]. 

i= 1 


This  relationship  is  called  the  law  of  total  probability.  Its  utility  is  illustrated  next. 

Example  4.2  —  A  compound  experiment 

Two  urns  contain  different  proportions  of  red  and  black  balls.  Urn  1  has  a  pro¬ 
portion  pi  of  red  balls  and  a  proportion  1  —  pi  of  black  balls  whereas  urn  2  has 
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proportions  of  p2  and  1  —  p2  of  red  balls  and  black  balls,  respectively.  A  compound 
experiment  is  performed  in  which  an  urn  is  chosen  at  random,  followed  by  the  se¬ 
lection  of  a  ball.  We  would  like  to  find  the  probability  that  a  red  ball  is  selected. 
To  do  so  we  use  (4.4)  with  A  =  {red  ball  selected},  B\  —  {urn  1  chosen},  and 
£?2  =  {urn  2  chosen}.  Then 

P[red  ball  selected]  =  P[red  ball  selected  |  urn  1  chosen]  P  [urn  1  chosen] 

+P  [red  ball  selected  |  urn  2  chosen]  P  [urn  2  chosen] 

=  Pl  5  +P2I  =  j(pi  +P2). 


Do  B\  and  B2  really  partition  the  sample  space? 


To  verify  that  the  application  of  the  law  of  total  probability  is  indeed  valid  for  this 
problem,  we  need  to  show  that  B\  U  B2  =  S  and  B\  fl  B2  =  0.  In  our  description 
of  B\  and  B2  we  refer  to  the  choice  of  an  urn.  In  actuality,  this  is  shorthand  for  all 
the  balls  in  the  urn.  If  urn  1  contains  balls  numbered  1  to  N\ ,  then  by  choosing  urn 
1  we  are  really  saying  that  the  event  is  that  one  of  the  balls  numbered  1  to  N\  is 
chosen  and  similarly  for  urn  2  being  chosen.  Hence,  since  the  sample  space  consists 
of  all  the  numbered  balls  in  urns  1  and  2,  it  is  observed  that  the  union  of  B\  and 
B2  is  the  set  of  all  possible  outcomes  or  the  sample  space.  Also,  B\  and  B2  are 
mutually  exclusive  since  we  choose  urn  1  or  urn  2  but  not  both. 

A 

Some  more  examples  follow. 

Example  4.3  -  Probability  of  error  in  a  digital  communication  system 

In  a  digital  communication  system  a  “0”  or  “1”  is  transmitted  to  a  receiver.  Typi¬ 
cally,  either  bit  is  equally  likely  to  occur  so  that  a  prior  probability  of  1/2  is  assumed. 
At  the  receiver  a  decoding  error  can  be  made  due  to  channel  noise,  so  that  a  0  may 
be  mistaken  for  a  1  and  vice  versa.  Defining  the  probability  of  decoding  a  1  when  a 
0  is  transmitted  as  e  and  a  0  when  a  1  is  transmitted  also  as  e,  we  are  interested  in 
the  overall  probability  of  an  error.  A  probabilistic  model  summarizing  the  relevant 
features  is  shown  in  Figure  4.3.  Note  that  the  problem  at  hand  is  essentially  the 
same  as  the  previous  one.  If  urn  1  is  chosen,  then  we  transmit  a  0  and  if  urn  2 
is  chosen,  we  transmit  a  1.  The  effect  of  the  channel  is  to  introduce  an  error  so 
that  even  if  we  know  which  bit  was  transmitted,  we  do  not  know  the  received  bit. 
This  is  analogous  to  not  knowing  which  ball  was  chosen  from  the  given  urn.  The 
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Choose 
0  or  1 


P[  0]  =  P[1]  =  1/2 


0 

1 

A 


transmit  receive 


Figure  4.3:  Probabilistic  model  of  a  digital  communication  system. 

probability  of  error  is  from  (4.4) 

P  [error]  =  P [error  |0  transmitted]  P[0  transmitted] 

+P[error|l  transmitted] P[1  transmitted] 

=  =  e. 

<> 

Conditional  probabilities  can  be  quite  tricky,  in  that  they  sometimes  produce  coun¬ 
terintuitive  results.  A  famous  instance  of  this  is  the  Monty  Hall  or  Let’s  Make  a 
Deal  problem. 

Example  4.4  —  Monty  Hall  problem 

About  40  years  ago  there  was  a  television  game  show  called  “Let’s  Make  a  Deal”. 
The  game  show  host,  Monty  Hall,  would  present  the  contestant  with  three  closed 
doors.  Behind  one  door  was  a  new  car,  while  the  others  concealed  less  desireable 
prizes,  for  instance,  farm  animals.  The  contestant  would  first  have  the  opportunity 
to  choose  a  door,  but  it  would  not  be  opened.  Monty  would  then  choose  one  of  the 
remaining  doors  and  open  it.  Since  he  would  have  knowledge  of  which  door  led  to 
the  car,  he  would  always  choose  a  door  to  reveal  one  of  the  farm  animals.  Hence, 
if  the  contestant  had  chosen  one  of  the  farm  animals,  Monty  would  then  choose  the 
door  that  concealed  the  other  farm  animal.  If  the  contestant  had  chosen  the  door 
behind  which  was  the  car,  then  Monty  would  choose  one  of  the  other  doors,  both 
concealing  farm  animals,  at  random.  At  this  point  in  the  game,  the  contestant  was 
faced  with  two  closed  doors,  one  of  which  led  to  the  car  and  the  other  to  a  farm 
animal.  The  contestant  was  given  the  option  of  either  opening  the  door  she  had 
originally  chosen  or  deciding  to  open  the  other  door.  What  should  she  do?  The 
answer,  surprisingly,  is  that  by  choosing  to  switch  doors  she  has  a  probability  of  2/3 
of  winning  the  car!  If  she  stays  with  her  original  choice,  then  the  probability  is  only 
1/3.  Most  people  would  say  that  irregardless  of  which  strategy  she  decided  upon, 
her  probability  of  winning  the  car  is  1/2. 
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Mj 

1  2  3 

1 

Ci  2 

3 

0  5  5 

0  0  i* 

o  r  0 

Table  4.2:  Joint  probabilities  ( P[Ci,Mj ]  =  P[Mj\Ci\P[Ci\)  for  contestant’s  initial 
and  Monty’s  choice  of  doors.  Winning  door  is  1. 


To  see  how  these  probabilities  are  determined  first  assume  she  stays  with  her 
original  choice.  Then,  since  the  car  is  equally  likely  to  be  placed  behind  any 
of  the  three  doors,  the  probability  of  the  contestant’s  winning  the  car  is  1/3. 
Monty’s  choice  of  a  door  is  irrelevant  since  her  final  choice  is  always  the  same 
as  her  initial  choice.  However,  if  as  a  result  of  Monty’s  action  a  different  door 
is  selected  by  the  contestant,  then  the  probability  of  winning  becomes  a  condi¬ 
tional  probability.  We  now  compute  this  by  assuming  that  the  car  is  behind  door 
one.  Define  the  events  C{  —  {contestant  initially  chooses  door  i}  for  i  =  1,2,3  and 
Mj  =  {Monty  opens  door  j}  for  j  =  1,2,3.  Next  we  determine  the  joint  probabili¬ 
ties  P[Ci,Mj\  by  using 

P[Ci,Mj]  =  P[Mj\Ci\P[Ci\. 

Since  the  winning  door  is  never  chosen  by  Monty,  we  have  P[M\\Ci\  —  0.  Also, 
Monty  never  opens  the  door  initially  chosen  by  the  contestant  so  that  P[Mi\C{ ]  =  0. 
Then,  it  is  easily  verified  that 

P[M2\C3]  =  P[M3\C2]  =  1  (contestant  chooses  losing  door) 

P[M3\C\\  =  P[M2\C\]  =  ^  (contestant  chooses  winning  door) 

and  P[Ci]  —  1/3.  The  joint  probabilities  are  summarized  in  Table  4.2.  Since 
the  contestant  always  switches  doors,  the  winning  events  are  (2,3)  (the  contestant 
initially  chooses  door  2  and  Monty  chooses  door  3)  and  (3, 2)  (the  contestant  initially 
chooses  door  3  and  Monty  chooses  door  2).  As  shown  in  Table  4.2  (the  entries  with 
asterisks),  the  total  probability  is  2/3.  This  may  be  verified  directly  using 

P  [final  choice  is  door  1]  -  P[M3\C2]P[C2]  +  P[M2\C3]P[C3\ 

=  P[C2,M3]  +  P[C3,M2]. 

Alternatively,  the  only  way  she  can  lose  is  if  she  initially  chooses  door  one  since  she 
always  switches  doors.  This  has  a  probability  of  1/3  and  hence  her  probability  of 
winning  is  2/3.  In  effect,  Monty,  by  eliminating  a  door,  has  improved  her  odds! 
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4.4  Statistically  Independent  Events 


Two  events  A  and  B  are  said  to  be  statistically  independent  (or  sometimes  just 
independent)  if  P[A|j3]  =  P[A].  If  this  is  true,  then 

P[A  H  B] 

P[B } 

which  results  in  the  condition  for  statistical  independence  of 

P[AnB]=P[A\P[B].  (4.5) 


An  example  is  shown  in  Figure  4.2c.  There,  the  probability  of  A  is  unchanged  if  we 
know  that  the  outcome  is  contained  in  the  event  B.  Note,  however,  that  once  we 
know  that  B  has  occurred,  the  outcome  could  not  have  been  in  the  uncross-hatched 
region  of  A  but  must  be  in  the  cross-hatched  region.  Knowing  that  B  has  occurred 
does  in  fact  affect  the  possible  outcomes.  However,  it  is  the  ratio  of  P[A  fl  B]  to 
P[B\  that  remains  the  same. 


Example  4.5  —  Statistical  independence  does  not  mean  one  event  does 
not  affect  another  event. 

If  a  fair  die  is  tossed,  the  probability  of  a  2  or  a  3  is  P[A  =  {2,3}]  =  1/3.  Now 
assume  we  know  that  the  outcome  is  an  even  number  or  B  =  {2, 4, 6}.  Recomputing 
the  probability 


P\A\B]  =  ^  P]  _ 

L  1  J  P[B]  P[{  2,4,6}] 

=  5  =  m 

Hence,  A  and  B  are  independent.  Yet,  knowledge  of  B  occurring  has  affected  the 
possible  outcomes.  In  particular,  the  event  A  fl  B  =  {2}  has  half  as  many  elements 
as  A,  but  the  reduced  sample  space  S'  =  B  also  has  half  as  many  elements. 

❖ 

The  condition  for  the  event  A  to  be  independent  of  the  event  B  is  P[A  fl  B]  = 
P[A]P[B].  Hence,  we  need  only  know  the  marginal  probabilities  or  P[A\,P[B]  to 
determine  the  joint  probability  P[A  n  B] .  In  practice,  this  property  turns  out  to  be 
very  useful.  Finally,  it  is  important  to  observe  that  statistical  independence  has  a 
symmetry  property,  as  we  might  expect.  If  A  is  independent  of  B,  then  B  must  be 
independent  of  A  since 


P[B \A\ 


P[B  n  A 

~iw~ 

P[A  n  B\ 
P[A] 

=  mm 

P[A] 

=  P[B } 


(definition) 

(commutative  property) 
(A  is  independent  of  B ) 
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and  therefore  B  is  independent  of  A.  Henceforth,  we  can  say  that  the  events  A  and 
B  are  statistically  independent  of  each  other,  without  further  elaboration. 


A 


Statistically  independent  events  are  different  than  mutually  ex¬ 


clusive  events. 


If  A  and  B  are  mutually  exclusive  and  B  occurs,  then  A  cannot  occur.  Thus, 
P[A\B]  =  0.  If  A  and  B  are  statistically  independent  and  B  occurs,  then  P[A\B]  = 
P[A\.  Clearly,  the  probabilities  P[A|i?]  are  only  the  same  if  P[A ]  =  0.  In  general 
then,  the  conditions  of  mutually  exclusivity  and  independence  must  be  different 
since  they  lead  to  different  values  of  P[A\B].  A  specific  example  of  events  that 


Figure  4.4:  Events  that  are  mutually  exclusive  (since  A  fl  B  =  0)  and  independent 
(since  P[A  nfl]  =  P[0]  =  0  and  P[A]P[B]  =  0  •  P[B]  =  0). 

are  both  mutually  exclusive  and  statistically  independent  is  shown  in  Figure  4.4. 
Finally,  the  two  conditions  produce  different  relationships,  namely 

P[AU  B]  —  P[A]+  P[B]  mutually  exclusive  events 

P[An  B]  =  P[A]P[B]  statistically  independent  events. 

See  also  Figure  4.2c  for  statistically  independent  but  not  mutually  exclusive  events. 
Can  you  think  of  a  case  of  mutually  exclusive  but  not  independent  events? 

A 

Consider  now  the  extension  of  the  idea  of  statistical  independence  to  three  events. 
Three  events  are  defined  to  be  independent  if  the  knowledge  that  any  one  or  two 
of  the  events  has  occurred  does  not  affect  the  probability  of  the  third  event.  For 
example,  one  condition  is  that  P[A\B  fl  C]  =  P[A\.  We  will  use  the  shorthand 
notation  P[A\B,C\  to  indicate  that  this  is  the  probability  of  A  given  that  B  and 
C  has  occurred.  Note  that  if  B  and  C  has  occurred,  then  by  definition  B  DC  has 
occurred.  The  full  set  of  conditions  is 

P[A\B]  =  P[A\C]  =  P[A\B,C]  =  P[A] 

P[B\A]  =  P[B\C]  =  P[B\A,C]=P[B] 

P[C\A ]  =  P[C\B\  =  P[C\A,B]  =  P[C]. 
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These  conditions  are  satisfied  if  and  only  if 


P[AB] 

=  P[A\P[B] 

(4.6) 

P[AC ] 

=  P[A]P[C } 

(4.7) 

P[BC] 

=  P[B]P[C] 

(4.8) 

P[ABC] 

=  P[A]P[B]P[C]. 

(4.9) 

If  the  first  three  conditions  (4.6)-(4.8)  are  satisfied,  then  the  events  are  said  to  be 
pairwise  independent.  They  are  not  enough,  however,  to  ensure  independence.  The 
last  condition  (4.9)  is  also  required  since  without  it  we  could  not  assert  that 


P[A\B,C\  =  P[A\BC] 

P[ABC] 

P[BC] 

P[ABC] 

P[B]P[C] 

P[A]P[B]P[C] 

P[B]P[C] 

=  P[A] 


(definition  of  B  and  C  occurring) 
(definition  of  conditional  probability) 

(from  (4.8)) 

(from  (4.9)) 


and  similarly  for  the  other  conditions  (see  also  Problem  4.20  for  an  example).  In 
general,  events  E\,  . . . ,  E n  are  defined  to  be  statistically  independent  if 

P{EiEj\  =  P[Ei]P[Ej]  i  +  j 

P[EtEjEk]  =  P[Ei]P[Ej]P[Ek}  i¥=J^k 


P[EiE2  ■  ■  ■  En]  =  P[E\\P[E2\  ■  •  •  P[Ew], 

Although  statistically  independent  events  allow  us  to  compute  joint  probabilities 
based  on  only  the  marginal  probabilities,  we  can  still  determine  joint  probabilities 
without  this  property.  Of  course,  it  becomes  much  more  difficult.  Consider  three 
events  as  an  example.  Then,  the  joint  probability  is 

P[ABC]  =  P[A\B,  C]P[BC] 

=  P[A\B,C]P[B\C]P[C].  (4.10) 

This  relationship  is  called  the  probability  chain  rule.  One  is  required  to  determine 
conditional  probabilities,  not  always  an  easy  matter.  A  simple  example  follows. 
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Example  4.6  -  Tossing  a  fair  die  -  once  again 

If  we  toss  a  fair  die,  then  it  is  clear  that  the  probability  of  the  outcome  being  4  is 
1/6.  We  can,  however,  rederive  this  result  by  using  (4.10).  Letting 

A  =  {even  number}  =  {2,4, 6} 

B  =  {numbers  >  2}  =  {3, 4, 5, 6} 

C  =  {numbers  <  5}  =  {1, 2, 3, 4} 

we  have  that  ABC  —  {4}.  These  events  can  be  shown  to  be  dependent  (see  Problem 
4.21).  Now  making  use  of  (4.10)  and  noting  that  BC  =  {3,4}  it  follows  that 

P[ABC)  = 


P[A\B,C]P[B\C]P[C] 

IM  (2Jl\  ( t\  =  i 

2/6 )  \4/6/  \QJ  6 


0 


4.5  Bayes’  Theorem 


The  definition  of  conditional  probability  leads  to  a  famous  and  sometimes  contro¬ 
versial  formula  for  computing  conditional  probabilities.  Recalling  the  definition,  we 
have  that 


P|'4|B]-  P[B\ 

(4.11) 

and 

PIRUl  - 

P[B]A]  ~  P[A]  ■ 

(4.12) 

Upon  substitution  of  P[AB]  from  (4.11)  into  (4.12) 

PW4  -  P[A™B]  ■ 

(4.13) 

This  is  called  Bayes 7  theorem.  By  knowing  the  marginal  probabilities  P[A],P[B] 
and  the  conditional  probability  P[A\B],  we  can  determine  the  other  conditional 
probability  P[B\A\.  The  theorem  allows  us  to  perform  “inference”  or  to  assess 
(with  some  probability)  the  validity  of  an  event  when  some  other  event  has  been 
observed.  For  example,  if  an  urn  containing  an  unknown  composition  of  balls  is 
sampled  with  replacement  and  produces  an  outcome  of  10  red  balls,  what  are  we  to 
make  of  this?  One  might  conclude  that  the  urn  contains  only  red  balls.  Yet,  another 
individual  might  claim  that  the  urn  is  a  “fair”  one,  containing  half  red  balls  and 
half  black  balls,  and  attribute  the  outcome  to  luck.  To  test  the  latter  conjecture  we 
now  determine  the  probability  of  a  fair  urn  given  that  10  red  balls  have  just  been 
drawn.  The  reader  should  note  that  we  are  essentially  going  “backwards”  -  usually 
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we  compute  the  probability  of  choosing  10  red  balls  given  a  fair  urn.  Now  we  are 
given  the  outcomes  and  wish  to  determine  the  probability  of  a  fair  urn.  In  doing  so 
we  believe  that  the  urn  is  fair  with  probability  0.9.  This  is  due  to  our  past  experience 
with  our  purchases  from  urn.com.  In  effect,  we  assume  that  the  prior  probability  of 
B  =  {fair  urn}  is  P[B\  =  0.9.  If  A  =  {10  red  balls  drawn},  we  wish  to  determine 
P[B\A],  which  is  the  probability  of  the  urn  being  fair  after  the  experiment  has  been 
performed  or  the  posterior  probability.  This  probability  is  our  reassessment  of  the 
fair  urn  in  light  of  the  new  evidence  (10  red  balls  drawn).  Let’s  compute  P{B|A] 
which  according  to  (4.13)  requires  knowledge  of  the  prior  probability  P[B\  and  the 
conditional  probability  P[A\B}.  The  former  was  assumed  to  be  0.9  and  the  latter  is 
the  probability  of  drawing  10  successive  red  balls  from  an  urn  with  p  —  1/2.  Prom 
our  previous  work  this  is  given  by  the  binomial  law  as 

P[A\B]  =  P[fc  =  10]  = 

-  (S)a)“G)‘-G)" 

We  still  need  to  find  P[A].  But  this  is  easily  found  using  the  law  of  total  probability 
as 


P[A\  =  P[A\B]P[B]  +  P[A\BC]P[BC] 

=  P{A\B]P[B)  +  P[A\BC}{\  -  P[B]) 


and  thus  only  P[A\BC]  needs  to  be  determined  (and  which  is  not  equal  to  1  —  P[A\B] 
as  is  shown  in  Problem  4.9).  This  is  the  conditional  probability  of  drawing  10  red 
balls  from  a  unfair  urn.  For  simplicity  we  will  assume  that  an  unfair  urn  has  all  red 
balls  and  thus  P[A\BC ]  =  1.  Now  we  have  that 


P[A}  = 


1 


10 


(0.9)  +  (1)(0.1) 


and  using  this  in  (4.13)  yields 


(i) 10  (0.9) 


P\B\A]  = 


(i)  (0.9)  +  (1)(0.1) 


0.0087. 


The  posterior  probability  (after  10  red  balls  have  been  drawn)  that  the  urn  is  fair 
is  only  0.0087.  Our  conclusion  would  be  to  reject  the  assumption  of  a  fair  urn. 

Another  way  to  quantify  the  result  is  to  compare  the  posterior  probability  of  the 
unfair  urn  to  the  probability  of  the  fair  urn  by  the  ratio  of  the  former  to  the  latter. 
This  is  called  the  odds  ratio  and  it  is  interpreted  as  the  odds  against  the  hypothesis 
of  a  fair  urn.  In  this  case  it  is 


odds  = 


P[BC\A 

P[B\A] 


1  -  0.0087 
0.0087 


113. 
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It  is  seen  from  this  example  that  based  on  observed  “data” ,  prior  beliefs  embodied 
in  P[B]  =  0.9  can  be  modified  to  yield  posterior  beliefs  or  P[B\A]  =  0.0087.  This 
is  an  important  concept  in  statistical  inference  [Press  2003]. 

In  the  previous  example,  we  used  the  law  of  total  probability  to  determine  the 
posterior  probability.  More  generally,  if  a  set  of  B{  s  partition  the  sample  space, 
then  Bayes’  theorem  can  be  expressed  as 


pro , ,,  =  pmimi 
EL  p{a\b,]p{b,] 


A;  =  1,2,...,  iV. 


(4.14) 


The  denominator  in  (4.14)  serves  to  normalize  the  posterior  probability  so  that  the 
conditional  probabilities  sum  to  one  or 


N 

y£P[Bk\A]  =  l. 
k  =  1 


In  many  problems  one  is  interested  in  determining  whether  an  observed  event 
or  effect  is  the  result  of  some  cause.  Again  the  backwards  or  inferential  reasoning 
is  implicit.  Bayes’  theorem  can  be  used  to  quantify  this  connection  as  illustrated 
next. 

Example  4.7  -  Medical  diagnosis 

Suppose  it  is  known  that  0.001%  of  the  general  population  has  a  certain  type  of 
cancer.  A  patient  visits  a  doctor  complaining  of  symptoms  that  might  indicate  the 
presence  of  this  cancer.  The  doctor  performs  a  blood  test  that  will  confirm  the 
cancer  with  a  probability  of  0.99  if  the  patient  does  indeed  have  cancer.  However, 
the  test  also  produces  false  positives  or  says  a  person  has  cancer  when  he  does  not. 
This  occurs  with  a  probability  of  0.2.  If  the  test  comes  back  positive,  what  is  the 
probability  that  the  person  has  cancer? 

To  solve  this  problem  we  let  B  =  {person  has  cancer},  the  causitive  event,  and 
A  =  {test  is  positive},  the  effect  of  that  event.  Then,  the  desired  probability  is 


P[B\A  = 


P[A\B]P[B\ 

P[A\B]P[B]  +  P[A\BC]P[BC] 
(0.99)  (0.00001) 

(0.99)  (0.00001)  +  (0.2)(0.99999) 


The  prior  probability  of  the  person  having  cancer  is  P[B]  =  10“5  while  the  posterior 
probability  of  the  person  having  cancer  (after  the  test  is  performed  and  found  to 
be  positive)  is  P[P|A]  =  4.95  x  10-5.  With  these  results  the  doctor  might  be  hard 
pressed  to  order  additional  tests.  This  is  quite  surprising,  and  is  due  to  the  prior 
probability  assumed,  which  is  quite  small  and  therefore  tends  to  nullify  the  test 
results.  If  we  had  assumed  that  P[B]  =  0.5,  for  indeed  the  doctor  is  seeing  a  patient 
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who  is  complaining  of  symptoms  consistent  with  cancer  and  not  some  person  chosen 
at  random  from  the  general  population,  then 

mm  (o.wko.5) 

1  1  ‘  (0.99)(0.5)+(0.2)(0.6) 

which  seems  more  reasonable  (see  also  Problem  4.23).  The  controversy  surrounding 
the  use  of  Bayes’  theorem  in  probability  calculations  can  almost  always  be  traced 
back  to  the  prior  probability  assumption.  Bayes’  theorem  is  mathematically  correct 
-  only  its  application  is  sometimes  in  doubt! 

0 


4.6  Multiple  Experiments 

4.6.1  Independent  Subexperiments 

An  experiment  that  was  discussed  in  Chapter  1  was  the  repeated  tossing  of  a  coin. 
We  can  alternatively  view  this  experiment  as  a  succession  of  subexperiments ,  with 
each  subexperiment  being  a  single  toss  of  the  coin.  It  is  of  interest  to  investigate  the 
relationship  between  the  probabilities  defined  on  the  experiment  and  those  defined 
on  the  subexperiments.  To  be  more  concrete,  assume  a  coin  is  tossed  twice  in 
succession  and  we  wish  to  determine  the  probability  of  the  event  A  =  {(iJ,T)}. 
Recall  that  the  notation  (H,T)  denotes  an  ordered  2-tuple  and  represents  a  head 
on  toss  1  and  a  tail  on  toss  2.  For  a  fair  coin  it  was  determined  to  be  1/4  since 
we  assumed  that  all  4  possible  outcomes  were  equally  likely.  This  seemed  like  a 
reasonable  assumption.  However,  if  the  coin  had  a  probability  of  heads  of  0.99,  we 
might  not  have  been  so  quick  to  agree  with  the  equally  likely  assumption.  How 
then  are  we  to  determine  the  probabilities?  Let’s  first  consider  the  experiment  to 
be  composed  of  two  separate  subexperiments  with  each  subexperiment  having  a 
sample  space  Sl  —  {if,  T}.  The  sample  space  of  the  overall  experiment  is  obtained 
by  forming  the  cartesian  product ,  which  for  this  example  is  defined  as 

S  =  S1  xS1 

=  {{i,j)  -.ies'ij  GS1} 

=  {(H,H),(H,T),(T,H),(T,T)}. 

It  is  formed  by  taking  an  outcome  from  S 1  for  the  first  element  of  the  2-tuple  and  an 
outcome  from  <S 1  for  the  second  element  of  the  2-tuple  and  doing  this  for  all  possible 
outcomes.  It  would  be  exceedingly  useful  if  we  could  determine  probabilities  for 
events  defined  on  S  from  those  probabilities  for  events  defined  on  S1.  In  this  way 
the  determination  of  probabilities  of  very  complicated  events  could  be  simplified. 
Such  is  the  case  if  we  assume  that  the  subexperiments  are  independent.  Continuing 
on,  we  next  calculate  P[A\  =  P[(H .  T)]  for  a  coin  with  an  arbitrary  probability  of 
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heads  p.  This  event  is  defined  on  the  sample  space  of  2-tuples,  which  is  S.  We  can, 
however,  express  it  as  an  intersection 

=  {(H,H),(H,T)}n{(H,T),(T,T)} 

—  {heads  on  toss  1}  fl  {tails  on  toss  2} 

-  HiHT2. 

We  would  expect  the  events  Hi  and  T2  to  be  independent  of  each  other.  Whether  a 
head  or  tail  appears  on  the  first  toss  should  not  affect  the  probability  of  the  outcome 
of  the  second  toss  and  vice  versa.  Hence,  we  will  let  P[(if,  T)]  —  P[J3’i]P[T2]  in 
accordance  with  the  definition  of  statistically  independent  events.  We  can  determine 
P[H{\  either  as  P[(H,H),  (if,  T)],  which  is  defined  on  S  or  equivalently  due  to  the 
independence  assumption  as  P[if],  which  is  defined  on  Sl.  Note  that  P[H ]  is  the 
marginal  probability  and  is  equal  to  P[(H,H)]  +  P[(if, T)\.  But  the  latter  was 
specified  to  be  p  and  therefore  we  have  that 

P[Hi ]  -  p 
P[T2]  =  1  -p 


and  finally, 

P[(H,T)]=p(l-p). 

For  a  fair  coin  we  recover  the  previous  value  of  1/4,  but  not  otherwise. 

Experiments  that  are  composed  of  subexperiments  whose  probabilities  of  the 
outcomes  do  not  depend  on  the  outcomes  of  any  of  the  other  subexperiments  are 
defined  to  be  independent  subexperiments.  Their  utility  is  to  allow  calculation  of  joint 
probabilities  from  marginal  probabilities.  More  generally,  if  we  have  M  independent 
subexperiments,  with  A{  an  event  described  for  experiment  z,  then  the  joint  event 
A  —  A\  fl  A2  fl  •  •  •  fl  Am  has  probability 

P[A]  =  P[A!]P[A2]  ■  ■  ■  P[Am],  (4.15) 

Apart  from  the  differences  in  sample  spaces  upon  which  the  probabilities  are  defined, 
independence  of  subexperiments  is  equivalent  to  statistical  independence  of  events 
defined  on  the  same  sample  space. 

4.6.2  Bernoulli  Sequence 

The  single  tossing  of  a  coin  with  probability  p  of  heads  is  an  example  of  a  Bernoulli 
trial.  Consecutive  independent  Bernoulli  trials  comprise  a  Bernoulli  sequence .  More 
generally,  any  sequence  of  M  independent  subexperiments  with  each  subexperiment 
producing  two  possible  outcomes  is  called  a  Bernoulli  sequence.  Typically,  the 
subexperiment  outcomes  are  labeled  as  0  and  1  with  the  probability  of  a  1  being  p. 
Hence,  for  a  Bernoulli  trial  P[0]  =  1  —  p  and  P[  1]  =  p.  Several  important  probability 
laws  are  based  on  this  model. 
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Binomial  Probability  Law 

Assume  that  M  independent  Bernoulli  trials  are  carried  out.  We  wish  to  determine 
the  probability  of  A;  l’s  (or  successes).  Each  outcome  is  an  M-tuple  and  a  successful 
outcome  would  consist  of  A;  l’s  and  M  —  k  0’s  in  any  order.  Thus,  each  successful 
outcome  has  a  probability  of  pk(  1  —p)M~k  due  to  independence.  The  total  number 
of  successful  outcomes  is  the  number  of  ways  A;  l’s  may  be  placed  in  the  M-tuple. 

This  is  known  from  combinatorics  to  be  (see  Section  3.8).  Hence,  by  summing 

up  the  probabilities  of  the  successful  simple  events,  which  are  mutually  exclusive, 
we  have 

=  P^\pk{l-p)M~k  k  =  0,1, . . .  ,M  (4. 

which  we  immediately  recognize  as  the  binomial  probability  law.  We  have  previously 
encountered  the  same  law  when  we  chose  M  balls  at  random  from  an  urn  with 
replacement  and  desired  the  probability  of  obtaining  k  red  balls.  The  proportion  of 
red  balls  was  p.  In  that  case,  each  subexperiment  was  the  choosing  of  a  ball  and  all 
the  subexperiments  were  independent  of  each  other.  The  binomial  probabilities  are 
shown  in  Figure  4.5  for  various  values  of  p. 


k 


k 


(a)  M  =  10,  p  =  0.5 


(b)  M  =  10,  p  =  0.7 


Figure  4.5:  The  binomial  probability  law  for  different  values  of  p. 


Geometric  Probability  Law 

Another  important  aspect  of  a  Bernoulli  sequence  is  the  appearance  of  the  first 
success.  If  we  let  k  be  the  Bernoulli  trial  for  which  the  first  success  is  observed, 
then  the  event  of  interest  is  the  simple  event  (f,  f, . . . ,  f,  s),  where  s,  f  denote  success 
and  failure,  respectively.  This  is  a  fc- tuple  with  the  first  k  —  1  elements  all  f’s.  The 
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probability  of  the  first  success  at  trial  k  is  therefore 

P[k)  =  {l-p)k-1p  k  =  1,2,...  (4.17) 

where  0  <  p  <  1.  This  is  called  the  geometric  probability  law.  The  geometric 
probabilities  are  shown  in  Figure  4.6  for  various  values  of  p.  It  is  interesting  to  note 
that  the  first  success  is  always  most  likely  to  occur  on  the  first  trial  or  for  k  =  1. 
This  is  true  even  for  small  values  of  p,  which  is  somewhat  counterintuitive.  However, 
upon  further  reflection,  for  the  first  success  to  occur  on  trial  k  =  1  we  must  have 
a  success  on  trial  1  and  the  outcomes  of  the  remaining  trials  are  arbitrary.  For  a 
success  on  trial  k  —  2,  for  example,  we  must  have  a  failure  on  trial  1  followed  by  a 
success  on  trial  2,  with  the  remaining  outcomes  arbitrary.  This  additional  constraint 
reduces  the  probability.  It  will  be  seen  later,  though,  that  the  average  number  of 
trials  required  for  a  success  is  1/p,  which  is  more  in  line  with  our  intuition.  An 


(a)  p  =  0.25  (b)  p  =  0.5 

Figure  4.6:  The  geometric  probability  law  for  different  values  of  p. 

example  of  its  use  follows. 

Example  4.8  -  Telephone  calling 

A  fax  machine  dials  a  phone  number  that  is  typically  busy  80%  of  the  time.  The 
machine  dials  it  every  5  minutes  until  the  line  is  clear  and  the  fax  is  able  to  be 
transmitted.  What  is  the  probability  that  the  fax  machine  will  have  to  dial  the 
number  9  times?  The  number  of  times  the  line  is  busy  can  be  considered  the  number 
of  failures  with  each  failure  having  a  probability  of  1  —  p  =  0.8.  If  the  number  is 
dialed  9  times,  then  the  first  success  occurs  for  k  =  9  and 

P[  9]  =  (0.8)8(0.2)  =  0.0336. 
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A  useful  property  of  the  geometric  probability  law  is  that  it  is  memoryless.  Assume 
it  is  known  that  no  successes  occurred  in  the  first  m  trials.  Then,  the  probability  of 
the  first  success  at  trial  m  + 1  is  the  same  as  if  we  had  started  the  Bernoulli  sequence 
experiment  over  again  and  determined  the  probability  of  the  first  success  at  trial  l 
(see  Problem  4.34). 

4.6.3  Multinomial  Probability  Law 

Consider  an  extension  to  the  Bernoulli  sequence  in  which  the  trials  are  still  inde¬ 
pendent  but  the  outcomes  for  each  trial  may  take  on  more  than  two  values.  For 
example,  let  S 1  =  {1,2,3}  and  denote  the  probabilities  of  the  outcomes  1,  2,  and 
3  by  pi,  p25  and  P3,  respectively.  As  usual,  the  assignment  of  these  probabilities 

o 

must  satisfy  ]Tb=  1  Pi  =  1-  Also,  let  the  number  of  trials  be  M  =  6  so  that  a  pos¬ 
sible  outcome  might  be  (2, 1, 3, 1, 2, 2),  whose  probability  is  P2P1P3P1P2P2  =  p\p2,p\- 
The  multinomial  probability  law  specifies  the  probability  of  obtaining  k\  l’s,  k,2 
2’s,  and  kz  3’s,  where  k\  +  ko  +  A:  3  =  M  =  6.  In  the  current  example,  k\  —  2, 
k2  =  3,  and  kz  =  1.  Some  outcomes  with  the  same  number  of  l’s,  2’s’,  and  3’s 
are  (2, 1, 3, 1, 2, 2),  (1, 2, 3, 1, 2, 2),  (1, 2, 1, 2, 2, 3),  etc.,  with  each  outcome  having  a 
probability  of  PiP^Pz-  The  total  number  of  these  outcomes  will  be  the  total  number 
of  distinct  6-tuples  that  can  be  made  with  the  numbers  1, 1, 2, 2, 2, 3.  If  the  numbers 
to  be  used  were  all  different,  then  the  total  number  of  6-tuples  would  be  6!  ,  or  all 
permutations.  However,  since  they  are  not,  some  of  these  permutations  will  be  the 
same.  For  example,  we  can  arrange  the  2’s  3!  ways  and  still  have  the  same  6-tuple. 
Likewise,  the  l’s  can  be  arranged  2!  ways  without  changing  the  6-tuple.  As  a  result, 
the  total  number  of  distinct  6-tuples  is 


6! 

2!3!l! 


(4.18) 


which  is  called  the  multinomial  coefficient.  (See  also  Problem  4.36  for  another  way 
to  derive  this.)  It  is  sometimes  denoted  by 


(  6 

12,3,1 

Finally,  for  our  example  the  probability  of  the  sequence  exhibiting  two  l’s,  three 
2’s,  and  one  3  is 

6!  9  3  1 

2!3!H*W3' 

This  can  be  generalized  to  the  case  of  M  trials  with  N  possible  outcomes  for  each 
trial.  The  probability  of  k\  l’s,  k2  2’s,...,  k n  jV’s  is 


P[&1,  •  •  •  7  M  = 


fcl  +  &2  +  *  *  *  +  fcjV  —  M 

(4.19) 
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and  where  J2iLiPi  ~  1-  This  is  termed  the  multinomial  probability  law.  Note  that  if 
N  =  2,  then  it  reduces  to  the  binomial  law  (see  Problem  4.37).  An  example  follows. 


Example  4.9  —  A  version  of  scrabble 

A  person  chooses  9  letters  at  random  from  the  English  alphabet  with  replacement. 
What  is  the  probability  that  she  will  be  able  to  make  the  word  “committee”  ?  Here 
we  have  that  the  outcome  on  each  trial  is  one  of  26  letters.  To  be  able  to  make  the 
word  she  needs  kc  —  1  ,ke  =  2,^  =  l,A;m  =  2,  k0  =  l,kt  =  2,  and  fc0ther  =  0.  We 
have  denoted  the  outcomes  as  c, e, i,ra,o, £,  and  “other”.  “Other”  represents  the 
remaining  20  letters  so  that  N  =  7.  Thus,  the  probability  is  from  (4.19) 


—  2,  Aether  —  0]  — 


1,2, 1,2, 1,2,0 


o 


since  pc  =  pe  =  pi  =  pm  =  p0  =  pt  =  1/26  and  p0ther  =  20/26  due  to  the  assumption 
of  “at  random”  sampling  and  with  replacement.  This  becomes 


P  kc  f  5  2,  ki  1,  km  —  2,  k0  —  1,  kf  —  2,  Another  —  9]  — 


9! 


1 


1!2!1!2!1!2!0!  V  26 


=  8.35  x  10“9. 


❖ 


4.6.4  Nonindependent  Subexperiments 

When  the  subexperiments  are  independent,  the  calculation  of  probabilities  can  be 
greatly  simplified.  An  event  that  can  be  written  as  A  =  A\  fl  A2  fl  •  •  •  fl  Am  can  be 
found  via 

P[A]  =  P[Ax]P[A2]  •  •  •  P[Am] 

where  each  P[Aj\  can  be  found  by  considering  only  the  individual  subexperiment. 
However,  the  assumption  of  independence  can  sometimes  be  unreasonable.  In  the 
absence  of  independence,  the  probability  would  be  found  by  using  the  chain  rule 
(see  (4.10)  for  M  =  3) 

P\A]  =  P[Am\Am-u  •  •  • ,  Ai\P[Am-i\Am~2i  •  •  •  ?  A\\  •  •  •  P[A.2|Ai]P[Ai].  (4.20) 

Such  would  be  the  case  if  a  Bernoulli  sequence  were  composed  of  nonindependent 
trials  as  illustrated  next. 

Example  4.10  —  Dependent  Bernoulli  trials 

Assume  that  we  have  two  coins.  One  is  fair  and  the  other  is  weighted  to  have 
a  probability  of  heads  of  p  ^  1/2.  We  begin  the  experiment  by  first  choosing  at 
random  one  of  the  two  coins  and  then  tossing  it.  If  it  comes  up  heads,  we  choose 
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the  fair  coin  to  use  on  the  next  trial.  If  it  comes  up  tails,  we  choose  the  weighted 
coin  to  use  on  the  next  trial.  We  repeat  this  procedure  for  all  the  succeeding  trials. 
One  possible  sequence  of  outcomes  is  shown  in  Figure  4.7a  for  the  weighted  coin 
having  p  =  1/4.  Also  shown  is  the  case  when  p  =  1/2  or  a  fair  coin  is  always  used, 


(a)  M  =  100,  p  =  0.25  (b)  M  =  100,  p  =  0.5 

Figure  4.7:  Dependent  Bernoulli  sequence  for  different  values  of  p. 

so  that  we  are  equally  likely  to  observe  a  head  or  a  tail  on  each  trial.  Note  that  in 
the  case  of  p  =  1/4  (see  Figure  4.7a),  if  the  outcome  is  a  tail  on  any  trial,  then  we 
use  the  weighted  coin  for  the  next  trial.  Since  the  weighted  coin  is  biased  towards 
producing  a  tail,  we  would  expect  to  again  see  a  tail,  and  so  on.  This  accounts  for 
the  long  run  of  tails  observed.  Clearly,  the  trials  are  not  independent. 

0 

If  we  think  some  more  about  the  previous  experiment,  we  realize  that  the  depen¬ 
dency  between  trials  is  due  only  to  the  outcome  of  the  ( i  —  l)st  trial  affecting  the 
outcome  of  the  ith  trial.  In  fact,  once  the  coin  has  been  chosen,  the  probabilities 
for  the  next  trial  are  either  P[0]  =  P[l]  =  1/2  if  a  head  occurred  on  the  pre¬ 
vious  trial  or  P[0]  =  3/4,  P[l]  =  1/4  if  the  previous  trial  produced  a  tail.  The 
previous  outcome  is  called  the  state  of  the  sequence.  This  behavior  may  be  sum¬ 
marized  by  the  state  probability  diagram  shown  in  Figure  4.8.  The  probabilities 
shown  are  actually  conditional  probabilities.  For  example,  3/4  is  the  probability 
P[tail  on  zth  toss|tail  on  i  —  1st  toss]  =  P[0|0],  and  similarly  for  the  others.  This 
type  of  Bernoulli  sequence,  in  which  the  probabilities  for  trial  i  depend  only  on  the 
outcome  of  the  previous  trial,  is  called  a  Markov  sequence.  Mathematically,  the 
probability  of  the  event  Ai  on  the  ?th  trial  given  all  the  previous  outcomes  can  be 
written  as 


P[Ai\Ai- 1,  Aj_2, 


■  •  • 


,  Ai]  =  P[Ai\Ai-i}. 


96 


CHAPTER  4.  CONDITIONAL  PROBABILITY 


l 

4 


Figure  4.8:  Markov  state  probability  diagram. 


Using  this  in  (4.20)  produces 


P[A\  =  P[Am\Am-i]P[Am-i\Am-2 ]  *  *  *  P[A2\Ai]P[Ai\. 


(4.21) 


The  conditional  probabilities  P[Ai\Ai-i\  are  called  the  state  transition  probabilities , 
and  along  with  the  initial  probability  P[Ai],  the  probability  of  any  joint  event  can 
be  determined.  For  example,  we  might  wish  to  determine  the  probability  of  N  =  10 
tails  in  succession  or  of  the  event  A  =  {(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)}.  If  the  weighted 
coin  was  actually  fair,  then  P[A]  =  (1/2)10  =  0.000976,  but  if  p  —  1/4,  we  have  by 
letting  Ai  =  {0}  for  i  =  1, 2, . . . ,  10  in  (4.21) 


10 


d—2 


P[Ai  . 


But  P[Ai\Ai-i]  =  P[0|0]  =  P [tails | weighted  coin]  =  3/4  for  i  =  2, 3, . . . ,  10.  Since 
we  initially  choose  one  of  the  coins  at  random,  we  have 


P[Ai]  = 


Thus,  we  have  that 


or  about  48  times  more  probable  than  if  the  weighted  coin  were  actually  fair.  Note 
that  we  could  also  represent  the  process  by  using  a  trellis  diagram  as  shown  in  Figure 
4.9.  The  probability  of  any  sequence  is  found  by  tracing  the  sequence  values  through 
the  trellis  and  multiplying  the  probabilities  for  each  branch  together,  along  with  the 
initial  probability.  Referring  to  Figure  4.9  the  sequence  1,0,0  has  a  probability  of 
(3/8)  (1/2)  (3/4).  The  foregoing  example  is  a  simple  case  of  a  Markov  chain.  We  will 
study  this  modeling  in  much  more  detail  in  Chapter  22. 


P[0]  =  P [tail |  weighted  coin]  P [weighted  coin] 
+P[tail|fair  coin]P[fair  coin] 


(§)"  0.0469 
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Figure  4.9:  Trellis  diagram. 

4.7  Real-World  Example  —  Cluster  Recognition 

In  many  areas  an  important  problem  is  the  detection  of  a  “cluster.”  Epidemiology 
is  concerned  with  the  incidence  of  a  greater  than  expected  number  of  disease  cases 
in  a  given  geographic  area.  If  such  a  situation  is  found  to  exist,  then  it  may  indicate 
a  problem  with  the  local  water  supply,  as  an  example.  Police  departments  may  wish 
to  focus  their  resources  on  areas  of  a  city  that  exhibit  an  unusually  high  incidence 
of  crime.  Portions  of  a  remotely  sensed  image  may  exhibit  an  increased  number  of 
noise  bursts.  This  could  be  due  to  a  group  of  sensors  that  are  driven  by  a  faulty 
power  source.  In  all  these  examples,  we  wish  to  determine  if  a  cluster  of  events 
has  occurred.  By  cluster,  we  mean  that  more  occurrences  of  an  event  are  observed 
than  would  normally  be  expected.  An  example  could  be  a  geographic  area  which 
is  divided  into  a  grid  of  50  x  50  cells  as  shown  in  Figure  4.10.  It  is  seen  that 
an  event  or  “hit”,  which  is  denoted  by  a  black  square,  occurs  rather  infrequently. 
In  this  example,  it  occurs  29/2500  =  1.16%  of  the  time.  Now  consider  Figure 
4.11.  We  see  that  the  shaded  area  appears  to  exhibit  more  hits  than  the  expected 
145  x  0.0116  =  1.68  number.  One  might  be  inclined  to  call  this  shaded  area  a  cluster. 
But  how  probable  is  this  cluster?  And  how  can  we  make  a  decision  to  either  accept 
the  hypothesis  that  this  area  is  a  cluster  or  to  reject  it?  To  arrive  at  a  decision  we 
use  a  Bayesian  approach.  It  computes  the  odds  ratio  against  the  occurrence  of  a 
cluster  (or  in  favor  of  no  cluster),  which  is  defined  as 

odds  —  ^no  clus^erl°bserved  data] 

P  [cluster  |  observed  data] 

If  this  number  is  large,  typically  much  greater  than  one,  we  would  be  inclined  to 
reject  the  hypothesis  of  a  cluster,  and  otherwise,  to  accept  it.  We  can  use  Bayes’  the¬ 
orem  to  evaluate  the  odds  ratio  by  letting  B  —  {cluster}  and  A  —  {observed  data}. 
Then, 

jj  =  P[BC\A]  P[A\BC]P[BC 
P[B\A]  P[A\B]P[B]  ' 
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Figure  4.10:  Geographic  area  with  incidents  shown  as  black  squares  -  no  cluster 
present. 


Note  that  P[A]  is  not  needed  since  it  cancel  outs  in  the  ratio.  To  evaluate  this  we 
need  to  determine  P[B],  P[A\B%  P[A\B].  The  first  probability  P[B\  is  the  prior 
probability  of  a  cluster.  Since  we  believe  a  cluster  is  quite  unlikely,  we  assign  a 
probability  of  10“6  to  this.  Next  we  need  P[A|BC]  or  the  probability  of  the  observed 
data  if  there  is  no  cluster.  Since  each  cell  can  take  on  only  one  of  two  values, 
either  a  hit  or  no  hit,  and  if  we  assume  that  the  outcomes  of  the  various  cells  are 
independent  of  each  other,  we  can  model  the  data  as  a  Bernoulli  sequence.  For  this 
problem,  we  might  be  tempted  to  call  it  a  Bernoulli  array  but  the  determination 
of  the  probabilities  will  of  course  proceed  as  usual.  If  M  cells  are  contained  in  the 
supposed  cluster  area  (shown  as  shaded  in  Figure  4.11  with  M  =  145),  then  the 
probability  of  k  hits  is  given  by  the  binomial  law 

P[k]  =  (^\pk{l-p)M-k. 

Next  must  assign  values  to  p  under  the  hypothesis  of  a  cluster  present  and  no 
cluster  present.  From  Figure  4.10  in  which  we  did  not  suspect  a  cluster,  the  relative 


4.7.  REAL-WORLD  EXAMPLE  -  CLUSTER  RECOGNITION 


99 


Figure  4.11:  Geographic  area  with  incidents  shown  as  black  squares  -  possible  cluster 
present. 


frequency  of  hits  was  about  0.0116  so  that  we  assume  pnc  =  0.01  when  there  is 
no  cluster.  When  we  believe  a  cluster  is  present,  we  assume  that  pc  =  0.1  in 
accordance  with  the  relative  frequency  of  hits  in  the  shaded  area  of  Figure  4.11, 
which  is  11/145=0.07.  Thus, 


P[A\BC]  = 


P[A \B] 


P  [observed  data|no  cluster]  =  (  ^  J  pkc(l  —  pnc)M~k 
P[k  =  ll|no  cluster]  =  ^ ^ \  (0.01)u(0.99)134 
P[observed  data|cluster]  =  p£(l  —pc)M~k 


=  P[k  =  lljcluster]  =  (0.1)n(0-9) 


which  results  in  an  odds  ratio  of 


odds  = 


_  (o.oi)11(o.99)134(i  -  nr6) 


(0.1)11(0.9)134(10-6) 


3.52. 
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Since  the  posterior  probability  of  no  cluster  is  3.52  times  larger  than  the  posterior 
probability  of  a  cluster,  we  would  reject  the  hypothesis  of  a  cluster  present.  However, 
the  odds  against  a  cluster  being  present  are  not  overwhelming.  In  fact,  the  computer 
simulation  used  to  generate  Figures  4.11  employed  p  =  0.01  for  the  unshaded  region 
and  p  =  0.1  for  the  shaded  cluster  region.  The  reader  should  be  aware  that  it  is 
mainly  the  influence  of  the  small  prior  probability  of  a  cluster,  P[B ]  =  10~6,  that 
has  resulted  in  the  greater  than  unity  odds  ratio  and  a  decision  to  reject  the  cluster 
present  hypothesis. 

References 

S.  Press,  Subjective  and  Objective  Bayesian  Statistics ,  John  Wiley  &  Sons,  New 
York,  2003. 

D.  Salsburg,  The  Lady  Tasting  Tea:  How  Statistics  Revolutionized  Science  in  the 
Twentieth  Century ,  W.H.  Freeman,  New  York,  2001. 


Problems 

4.1  (f)  If  B  C  A,  what  is  P[A\B]7  Explain  your  answer. 

4.2  (^)  (f)  A  point  x  is  chosen  at  random  within  the  interval  (0, 1).  If  it  is  known 

that  x  >  1/2,  what  is  the  probability  that  x  >  7/8? 

4.3  (w)  A  coin  is  tossed  three  times  with  each  3-tuple  outcome  being  equally  likely. 

Find  the  probability  of  obtaining  (if,  T,  H)  if  it  is  known  that  the  outcome 
has  2  heads.  Do  this  by  1)  using  the  idea  of  a  reduced  sample  space  and  2) 
using  the  definition  of  conditional  probability. 

4.4  (w)  Two  dice  are  tossed.  Each  2-tuple  outcome  is  equally  likely.  Find  the 

probability  that  the  number  that  comes  up  on  die  1  is  the  same  as  the  number 
that  comes  up  on  die  2  if  it  is  known  that  the  sum  of  these  numbers  is  even. 

4.5  (^)  (f)  An  urn  contains  3  red  balls  and  2  black  balls.  If  two  balls  are  chosen 

without  replacement,  find  the  probability  that  the  second  ball  is  black  if  it  is 
known  that  the  first  ball  chosen  is  black. 

4.6  (f)  A  coin  is  tossed  11  times  in  succession.  Each  11-tuple  outcome  is  equally 

likely  to  occur.  If  the  first  10  tosses  produced  all  heads,  what  is  the  probability 
that  the  11th  toss  will  also  be  a  head? 

4.7  (^)  (w)  Using  Table  4.1,  determine  the  probability  that  a  college  student  will 

have  a  weight  greater  than  190  lbs.  if  he/she  has  a  height  exceeding  5' 8”.  Next, 
find  the  probability  that  a  student’s  weight  will  exceed  190  lbs. 


PROBLEMS 


101 


4.8  (w)  Using  Table  4.1,  find  the  probability  that  a  student  has  weight  less  than 

160  lbs.  if  he/she  has  height  greater  than  5  4  .  Also,  find  the  probability  that 
a  student’s  weight  is  less  than  160  lbs.  if  he/she  has  height  less  than  5  4  .  Are 
these  two  results  related? 

4.9  (t)  Show  that  the  statement  P[A|P]  +  P[A|PC]  =  1  is  false.  Use  Figure  4.2a  to 

provide  a  counterexample. 

4.10  (t)  Prove  that  for  the  events  A,  B,C,  which  are  not  necessarily  mutually  ex¬ 
clusive, 

P[A  U  B\C\  =  P[A\C }  +  P[B\C\  -  P[AB\C\. 

4.11  (^)  (w)  A  group  of  20  patients  afflicted  with  a  disease  agree  to  be  part  of  a 
clinical  drug  trial.  The  group  is  divided  up  into  two  groups  of  10  subjects  each, 
with  one  group  given  the  drug  and  the  other  group  given  sugar  water,  i.e.,  this 
is  the  control  group.  The  drug  is  80%  effective  in  curing  the  disease.  If  one 
is  not  given  the  drug,  there  is  still  a  20%  chance  of  a  cure  due  to  remission. 
What  is  the  probability  that  a  randomly  selected  subject  will  be  cured? 

4.12  (w)  A  new  bus  runs  on  Sunday,  Tuesday,  Thursday,  and  Saturday  while  an 
older  bus  runs  on  the  other  days.  The  new  bus  has  a  probability  of  being  on 
time  of  2/3  while  the  older  bus  has  a  probability  of  only  1/3.  If  a  passenger 
chooses  an  arbitrary  day  of  the  week  to  ride  the  bus,  what  is  the  probability 
that  the  bus  will  be  on  time? 

4.13  (w)  A  digital  communication  system  transmits  one  of  the  three  values  —1, 0, 1. 

A  channel  adds  noise  to  cause  the  decoder  to  sometimes  make  an  error.  The 

error  rates  are  12.5%  if  a  -1  is  transmitted,  75%  if  a  0  is  transmitted,  and 

12.5%  if  a  1  is  transmitted.  If  the  probabilities  for  the  various  symbols  being 
transmitted  are  P[—  1]  =  P[l]  =  1/4  and  P[0]  =  1/2,  find  the  probability  of 
error.  Repeat  the  problem  if  P[— 1]  =  P[0]  =  P[l]  and  explain  your  results. 

4.14  (^)  (w)  A  sample  space  is  given  by  S  =  {{x,y)  :  0  <  x  <  1,0  <  y  <  1}. 
Determine  P[A|P]  for  the  events 

A  =  {(x,y)  :  y  <  2x,0  <  x  <  1/2  and  y  <  2  -2x,  1/2  <  x  <  1} 

B  —  {(x,y)  :  1/2  <  x  <  1,0  <  y  <  1}. 

Are  A  and  B  independent? 

4.15  (w)  A  sample  space  is  given  by  S  =  {(z,y)  :  0  <  x  <  1,0  <  y  <  1}.  Are  the 
events 


A  =  {{x,y)-y<x} 

B  =  {(x,y)  ■  y  <  1  -  x} 

independent?  Repeat  if  B  =  {(x,y)  :  x  <  1/4}. 
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4.16  (t)  Give  an  example  of  two  events  that  are  mutually  exclusive  but  not  inde¬ 
pendent.  Hint:  See  Figure  4.4. 

4.17  (t)  Consider  the  sample  space  S  =  {(x,y,z)  :  0  <  x  <  1,0  <  y  <  1,0  <  z  < 
1},  which  is  the  unit  cube.  Can  you  find  three  events  that  are  independent? 
Hint:  See  Figure  4.2c. 

4.18  (t)  Show  that  if  (4.9)  is  satisfied  for  all  possible  events,  then  pairwise  inde¬ 
pendence  follows.  In  this  case  all  events  are  independent. 

4.19  (0)  (f)  It  is  known  that  if  it  rains,  there  is  a  50%  chance  that  a  sewer  will 
overflow.  Also,  if  the  sewer  overflows,  then  there  is  a  30%  chance  that  the  road 
will  flood.  If  there  is  a  20%  chance  that  it  will  rain,  what  is  the  probability 
that  the  road  will  flood? 

4.20  (w)  Consider  the  sample  space  S  =  {1,2, 3,4}.  Each  simple  event  is  equally 
likely.  If  A  —  {1,2},J5  =  {1, 3},  C  =  {1,4},  are  these  events  pairwise  indepen¬ 
dent?  Are  they  independent? 

4.21  (^)  (w)  In  Example  4.6  determine  if  the  events  are  pairwise  independent. 
Are  they  independent? 

4.22  (N^/)  (w)  An  urn  contains  4  red  balls  and  2  black  balls.  Two  balls  are  chosen 
in  succession  without  replacement.  If  it  is  known  that  the  first  ball  drawn  is 
black,  what  are  the  odds  in  favor  of  a  red  ball  being  chosen  on  the  second 
draw? 

4.23  (w)  In  Example  4.7  plot  the  probability  that  the  person  has  cancer  given  that 
the  test  results  are  positive,  i.e.,  the  posterior  probability,  as  a  function  of  the 
prior  probability  P[B\.  How  is  the  posterior  probability  that  the  person  has 
cancer  related  to  the  prior  probability? 

4.24  (w)  An  experiment  consists  of  two  subexperiments.  First  a  number  is  chosen 
at  random  from  the  interval  (0, 1).  Then,  a  second  number  is  chosen  at  random 
from  the  same  interval.  Determine  the  sample  space  S 2  for  the  overall  exper¬ 
iment.  Next  consider  the  event  A  =  {(x,y)  :  1/4  <  x  <  1/2, 1/2  <  y  <  3/4} 
and  find  P[A],  Relate  P[A]  to  the  probabilities  defined  on  S1  =  {u  :  0  <  u  < 
1},  where  Sl  is  the  sample  space  for  each  subexperiment. 

4.25  (w,c)  A  fair  coin  is  tossed  10  times.  What  is  the  probability  of  a  run  of  exactly 
5  heads  in  a  row?  Do  not  count  runs  of  6  or  more  heads  in  a  row.  Now  verify 
your  solution  using  a  computer  simulation. 

4.26  (^)  (w)  A  lady  claims  that  she  can  tell  whether  a  cup  of  tea  containing 
milk  had  the  tea  poured  first  or  the  milk  poured  first.  To  test  her  claim  an 
experiment  is  set  up  whereby  at  random  the  milk  or  tea  is  added  first  to  an 
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empty  cup.  This  experiment  is  repeated  10  times.  If  she  correctly  identifies 
which  liquid  was  poured  first  8  times  out  of  10,  how  likely  is  it  that  she  is 
guessing?  See  [Salsburg  2001]  for  a  further  discussion  of  this  famous  problem. 

4.27  (f)  The  probability  P[k]  is  given  by  the  binomial  law.  If  M  =  10,  for  what 
value  of  p  is  P[ 3]  maximum?  Explain  your  answer. 

4.28  (o)  (f )  A  sequence  of  independent  subexperiments  is  conducted.  Each  subex¬ 
periment  has  the  outcomes  “success”,  “failure”,  or  “don’t  know”.  IfP[success]  = 
1/2  and  P [failure]  =  1/4,  what  is  the  probability  of  3  successes  in  5  trials? 

4.29  (c)  Verify  your  results  in  Problem  4.28  by  using  a  computer  simulation. 

4.30  (w)  A  drunk  person  wanders  aimlessly  along  a  path  by  going  forward  one  step 
with  probability  1/2  and  going  backward  one  step  with  probability  1/2.  After 
10  steps  what  is  the  probability  that  he  has  moved  2  steps  forward? 

4.31  (f)  Prove  that  the  geometric  probability  law  (4.17)  is  a  valid  probability  as¬ 
signment. 

4.32  (w)  For  a  sequence  of  independent  Bernoulli  trials  find  the  probability  of  the 
first  failure  at  the  A;th  trial  for  k  =  1,2..... 

4.33  (^)  (w)  For  a  sequence  of  independent  Bernoulli  trials  find  the  probability 
of  the  second  success  occurring  at  the  fctli  trial. 

4.34  (t)  Consider  a  sequence  of  independent  Bernoulli  trials.  If  it  is  known  that 
the  first  m  trials  resulted  in  failures,  prove  that  the  probability  of  the  first 
success  occurring  at  m  +  l  is  given  by  the  geometric  law  with  k  replaced  by 
l.  In  other  words,  the  probability  is  the  same  as  if  we  had  started  the  process 
over  again  after  the  mth  failure.  There  is  no  memory  of  the  first  m  failures. 

4.35  (f)  An  urn  contains  red,  black,  and  white  balls.  The  proportion  of  red  is  0.4, 
the  proportion  of  black  is  0.4,  and  the  proportion  of  white  is  0.2.  If  5  balls 
are  drawn  with  replacement,  what  is  the  probability  of  2  red,  2  black,  and  1 
white  in  any  order? 

4.36  (t)  We  derive  the  multinomial  coefficient  for  N  =  3.  This  will  yield  the  number 
of  ways  that  an  M-tuple  can  be  formed  using  k\  l’s,  2’s  and  kz  3’s.  To  do 
so  choose  ki  places  in  the  M-tuple  for  the  l’s.  There  will  be  M  —  k\  positions 
remaining.  Of  these  positions  choose  &2  places  for  the  2’s.  Fill  in  the  remaining 
kz  =  M  —  k\  —  &2  positions  using  the  3’s.  Using  this  result,  determine  the 
number  of  different  M  digit  sequences  with  k\  l’s,  &2  2’s,  and  kz  3’s. 

4.37  (t)  Show  that  the  multinomial  probability  law  reduces  to  the  binomial  law  for 
N  =  2. 
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4.38  (^)  (w,c)  An  urn  contains  3  red  balls,  3  black  balls,  and  3  white  balls.  If 
6  balls  are  chosen  with  replacement,  how  many  of  each  color  is  most  likely? 
Hint:  You  will  need  a  computer  to  evaluate  the  probabilities. 

4.39  (w,c)  For  the  problem  discussed  in  Example  4.10  change  the  probability  of 
heads  for  the  weighted  coin  from  p  =  0.25  to  p  =  0.1.  Redraw  the  Markov  state 
probability  diagram.  Next,  using  a  computer  simulation  generate  a  sequence 
of  length  100.  Explain  your  results. 

4.40  (o)  (f)  F°r  the  Markov  state  diagram  shown  in  Figure  4.8  with  an  initial 
state  probability  of  P[ 0]  =  3/4,  find  the  probability  of  the  sequence  0, 1, 1,0. 

4.41  (f)  A  two-state  Markov  chain  (see  Figure  4.8)  has  the  state  transition  probabil¬ 
ities  P[0|0]  =  1/4,  P[0|1]  =  3/4  and  the  initial  state  probability  of  P[0]  =  1/2. 
What  is  the  probability  of  the  sequence  0, 1, 0, 1, 0? 

4.42  (w)  A  digital  communication  system  model  is  shown  in  Figure  4.12.  It  consists 
of  two  sections  with  each  one  modeling  a  different  portion  of  the  communi¬ 
cation  channel.  What  is  the  probability  of  a  bit  error?  Compare  this  to  the 
probability  of  error  for  the  single  section  model  shown  in  Figure  4.3,  assuming 
that  e  <  1/2,  which  is  true  in  practice.  Note  that  Figure  4.12  is  a  trellis. 


Figure  4.12:  Probabilistic  model  of  a  digital  communication  system  with  two  sec¬ 
tions. 


4.43  (o)  (f)  F°r  the  trellis  shown  in  Figure  4.9  find  the  probability  of  the  event 
A  =  {(0,1, 0,0),  (0,0, 0,0)}. 


Chapter  5 


Discrete  Random  Variables 


5.1  Introduction 

Having  been  introduced  to  the  basic  probabilistic  concepts  in  Chapters  3  and  4, 
we  now  begin  their  application  to  solving  problems  of  interest.  To  do  so  we  define 
the  random  variable.  It  will  be  seen  to  be  a  function,  also  called  a  mapping ,  of  the 
outcomes  of  a  random  experiment  to  the  set  of  real  numbers.  With  this  association 
we  are  able  to  use  the  real  number  description  to  quantify  items  of  interest.  In 
this  chapter  we  describe  the  discrete  random  variable,  which  is  one  that  takes  on 
a  finite  or  countably  infinite  number  of  values.  Later  we  will  extend  the  definition 
to  a  random  variable  that  takes  on  a  continuum  of  values,  the  continuous  random 
variable.  The  mathematics  associated  with  a  discrete  random  variable  are  inherently 
simpler  and  so  conceptualization  is  facilitated  by  first  concentrating  on  the  discrete 
problem.  The  reader  has  already  been  introduced  to  the  concept  of  a  random 
variable  in  Chapter  2  in  an  informal  way  and  hence  may  wish  to  review  the  computer 
simulation  methodology  described  therein. 

5.2  Summary 

The  random  variable,  which  is  a  mapping  from  the  sample  space  into  the  set  of 
real  numbers,  is  formally  discussed  and  illustrated  in  Section  5.3.  In  Section  5.4 
the  probability  of  a  random  variable  taking  on  its  possible  values  is  given  by  (5.2). 
Next  the  probability  mass  function  is  defined  by  (5.3).  Some  important  probability 
mass  functions  are  summarized  in  Section  5.5.  They  include  the  Bernoulli  (5.5),  the 
binomial  (5.6),  the  geometric  (5.7),  and  the  Poisson  (5.8).  The  binomial  probability 
mass  function  can  be  approximated  by  the  Poisson  as  shown  in  Figure  5.8  if  M  — >  oo 
and  p  —>  0,  with  Mp  remaining  constant.  This  motivates  the  use  of  the  Poisson 
probability  mass  function  for  traffic  modeling.  If  a  random  variable  is  transformed 
to  a  new  one  via  a  mapping,  then  the  new  random  variable  has  a  probability  mass 
function  given  by  (5.9).  Next  the  cumulative  distribution  function  is  introduced  and 
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is  given  by  (5.10).  It  can  be  used  as  an  equivalent  description  for  the  probability 
of  a  discrete  random  variable.  Its  properties  are  summarized  in  Section  5.8.  The 
computer  simulation  of  discrete  random  variables  is  revisited  in  Section  5.9  with  the 
estimate  of  the  probability  mass  function  and  the  cumulative  distribution  function 
given  by  (5.14)  and  (5. 15),  (5. 16),  respectively.  Finally,  the  application  of  the  Poisson 
probability  model  to  determining  the  resources  required  to  service  customers  is 
described  in  Section  5.10. 


5.3  Definition  of  Discrete  Random  Variable 

We  have  previously  used  a  coin  toss  and  a  die  toss  as  examples  of  a  random  ex¬ 
periment.  In  the  case  of  a  die  toss  the  outcomes  comprised  the  sample  space 
S  =  {1,2, 3,4, 5,6}.  This  was  because  each  face  of  a  die  has  a  dot  pattern  con¬ 
sisting  of  1,  2,  3,  4,  5,  or  6  dots.  A  natural  description  of  the  outcome  of  a  die  toss 
is  therefore  the  number  of  dots  observed  on  the  face  that  appears  upward.  In  effect, 
we  have  mapped  the  dot  pattern  into  the  number  of  dots  in  describing  the  outcome. 
This  type  of  experiment  is  called  a  numerically  valued  random  phenomenon  since  the 
basic  output  is  a  real  number.  In  the  case  of  a  coin  toss  the  outcomes  comprise  the 
nonnumerical  sample  space  S  —  {head,  tail}.  We  have,  however,  at  times  replaced 
the  sample  space  by  one  consisting  only  of  real  numbers  such  as  Sx  =  {0, 1},  where 
a  head  is  mapped  into  a  1  and  a  tail  is  mapped  into  a  0.  This  mapping  is  shown 
in  Figure  5.1.  For  many  applications  this  is  a  convenient  mapping.  For  example,  in 


x 


S 


Figure  5.1:  Mapping  of  the  outcome  of  a  coin  toss  into  the  set  of  real  numbers. 


a  succession  of  M  coin  tosses,  we  might  be  interested  in  the  total  number  of  heads 
observed.  With  the  defined  mapping  of 


X(Si)  = 


0  S\  =  tail 
1  <S2  —  head 
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we  could  represent  the  number  of  heads  as  'YaLi  X(Si),  where  Si  is  the  outcome  of 
the  ith  toss.  The  function  that  maps  S  into  Sx  and  which  is  denoted  by  X(-)  is 
called  a  random  variable.  It  is  a  function  that  takes  each  outcome  of  S  (which  may 
not  necessarily  be  a  set  of  numbers)  and  maps  it  into  the  subset  of  the  set  of  real 
numbers.  Note  that  as  previously  mentioned  in  Chapter  2,  a  capital  letter  X  will 
denote  the  random  variable  and  a  lowercase  letter  x  its  value.  This  convention  for 
the  coin  toss  example  produces  the  assignment 

X(Si)=Xi  i  =  1,2 


where  Si  —  tail  and  thus  x\  =  0,  and  S2  =  head  and  thus  X2  =  1.  The  name 
random  variable  is  a  poor  one  in  that  the  function  X(-)  is  not  random  but  a  known 
one  and  usually  one  of  our  own  choosing.  What  is  random  is  the  input  argument  Si 
and  hence  the  output  of  the  function  is  random.  However,  due  to  its  long-standing 
usage  in  probability  we  will  retain  this  terminology. 

Sometimes  it  is  more  convenient  to  use  a  particular  random  variable  for  a  given 
experiment.  For  example,  in  Chapter  2  we  described  a  digital  communication  system 
called  a  PSK  system.  A  bit  is  communicated  using  the  transmitted  signals 

/  x  _  (  —Acos2nFot  for  a  0 
\  Acos27rFot  for  a  1. 

Usually  a  1  or  a  0  occurs  with  equal  probability  so  that  the  choice  of  a  bit  can  be 
modeled  as  the  outcome  of  a  fair  coin  tossing  experiment.  If  a  head  is  observed,  then 
a  1  is  transmitted  and  a  0  otherwise.  As  a  result,  we  could  represent  the  transmitted 
signal  with  the  model 

Si(t)  =  X (s^ A  cos  2nFot 

where  S\  =  tail  and  S2  —  head  and  hence  we  have  the  defined  random  variable 


X(Si) 


—  1  S\  —  tail 
+1  S2  =  head. 


This  random  variable  is  a  convenient  one  for  this  application. 

In  general,  a  random  variable  is  a  function  that  maps  the  sample  space  S  into  a 
subset  of  the  real  line.  The  real  line  will  be  denoted  by  R  (R  =  {x  :  —00  <  x  <  00}). 
For  a  discrete  random  variable  this  subset  is  a  finite  or  countably  infinite  set  of 
points.  The  subset  forms  a  new  sample  space  which  we  will  denote  by  <Sx,  and 
which  is  illustrated  in  Figure  5.2.  A  discrete  random  variable  may  also  map  multiple 
elements  of  the  sample  space  into  the  same  number  as  illustrated  in  Figure  5.3.  An 
example  would  be  a  die  toss  experiment  in  which  we  were  only  interested  in  whether 
the  outcome  is  even  or  odd.  To  quantify  this  outcome  we  could  define 

y  /  x _  f  0  if  Si  =  1, 3, 5  dots 

[1  if  Si  =  2, 4, 6  dots. 
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Figure  5.2:  Discrete  random  variable  as  a  one-to-one  mapping  of  a  countably  infinite 
sample  space  into  set  of  real  numbers. 


Figure  5.3:  Discrete  random  variable  as  a  many-to-one  mapping  of  a  countably 
infinite  sample  space  into  set  of  real  numbers. 


This  type  of  mapping  is  usually  called  a  many-to-one  mapping  while  the  previous 
one  is  called  a  one-to-one  mapping.  Note  that  for  a  many-to-one  mapping  we  cannot 
recover  the  outcome  of  S  if  we  know  the  value  of  X(s).  But  as  already  explained, 
this  is  of  little  concern  since  we  initially  defined  the  random  variable  to  output  the 
item  of  interest.  Lastly,  for  numerically  valued  random  experiments  in  which  s  is 
contained  in  i?,  we  can  still  use  the  random  variable  approach  if  we  define  X  (5)  =  s 
for  all  <s.  This  allows  the  concept  of  a  random  variable  to  be  used  for  all  random 
experiments,  with  either  numerical  or  nonnumerical  outputs. 


5.4  Probability  of  Discrete  Random  Variables 

We  would  next  like  to  determine  the  probabilities  of  the  random  variable  taking  on 
its  possible  values.  In  other  words,  what  is  the  probability  P[X(Si)  =  xi\  for  each 
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Since  the  sample  space  S  is  discrete,  the  random  variable  can  take  on  at 
most  a  countably  infinite  number  of  values  or  X(S{)  =  Xi  for  i  =  1, 2, . . ..  It  should 
be  clear  that  if  X(-)  maps  each  Si  into  a  different  X{  (or  X(-)  is  one-to-one),  then 
because  S{  and  X{  are  just  two  different  names  for  the  same  event 


P[X(s)  =  Xi]  =  P[{Sj  :  X(Sj)  =  Xi}]  =  P[{*}]  (5.1) 

or  we  assign  a  probability  to  the  value  of  the  random  variable  equal  to  that  of  the 
simple  event  in  S  that  yields  that  value.  If,  however,  there  are  multiple  outcomes 
in  S  that  map  into  the  same  value  Xi  (or  X(-)  is  many-to-one)  then 

P[X(s)=Xi]  =  P[{sj:X(sj)=xi}} 

E  (5-2) 

since  the  5?'s  are  simple  events  in  S  and  are  therefore  mutually  exclusive.  It  is  said 
that  the  events  {X  =  £j},  defined  on  Sx,  and  {Sj  :  X(Sj)  =  Xi},  defined  on  S,  are 
equivalent  events.  As  such  they  are  assigned  the  same  probability.  Note  that  the 
probability  assignment  (5.2)  subsumes  that  of  (5.1)  and  that  in  either  case  we  can 
summarize  the  probabilities  that  the  random  variable  values  take  on  by  defining  the 
probability  mass  function  (PMF)  as 


Px[xi]  =  P[A(s)  =  xt]  (5.3) 

and  use  (5.2)  to  evaluate  it  from  a  knowledge  of  the  mapping.  It  is  important  to 
observe  that  in  the  notation  px  [xi\  the  subscript  X  refers  to  the  random  variable  and 
also  the  [•]  notation  is  meant  to  remind  the  reader  that  the  argument  is  a  discrete 
one.  Later,  we  will  use  (•)  for  continuous  arguments.  In  summary,  the  probability 
mass  function  is  the  probability  that  the  random  variable  X  takes  on  the  value  X{ 
for  each  possible  xt.  An  example  follows. 

Example  5.1  —  Coin  toss  -  one-to-one  mapping 

The  experiment  consists  of  a  single  coin  toss  with  a  probability  of  heads  equal  to 
p.  The  sample  space  is  <S  =  {head,  tail}  and  we  define  the  random  variable  as 

X(s,)  =  (  °  *  = 

[  1  Si  =  head. 


The  PMF  is  therefore  from  from  (5.3)  and  (5.1) 


Px[0]  = 

Px  [1]  = 


P[X(s)=0]  =  1  -p 
P[X(s)  =  1]  =  p. 


0 
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Example  5.2  —  Die  toss  —  many-to-one  mapping 

The  experiment  consists  of  a  single  fair  die  toss.  With  a  sample  space  of  S  = 
{1, 2, 3, 4, 5, 6}  and  an  interest  only  in  whether  the  outcome  is  even  or  odd  we  define 
the  random  variable 


*(Si) 


0  if  *  =  1,3,5 
1  if  *  =  2,4, 6. 


Thus,  using  (5.3)  and  (5.2)  we  have  the  PMF 


Px[0] 
Px[  1] 


=  n*(*)=o]=  £  *>[{«,-}]  =  § 

j= 1,3,5 

=  P[X(s)  =  l]  =  £  P[{«,}]  =  |. 

3= 2,4,6 


0 

The  use  of  (5.2)  may  seem  familiar  and  indeed  it  should.  We  have  summed  the 
probabilities  of  simple  events  in  S  to  obtain  the  probability  of  an  event  in  S  using 
(3.10).  Here,  the  event  is  just  the  subset  of  S  for  which  X(s)  =  X{  holds.  The 
introduction  of  a  random  variable  has  quantified  the  events  of  interest! 

Finally,  because  PMFs  px  [xl]  axe  just  new  names  for  the  probabilities  P[X(<s)  = 
xi\  they  must  satisfy  the  usual  properties: 

Property  5.1  —  Range  of  values 


0  <px[xi]  <  1 


□ 


Property  5.2  —  Sum  of  values 

M 

T,Px[xi\  =  1  if  Sx  consists  of  M  outcomes 
1=1 

OO 

'S^Px[%i]  =  1  if  Sx  is  countably  infinite. 

i=  1 

□ 

We  will  frequently  omit  the  s  argument  of  X  to  write  px[xi]  —  P[X  =  xt}. 

Once  the  PMF  has  been  specified  all  subsequent  probability  calculations  can  be 
based  on  it,  without  referring  back  to  the  original  sample  space  S.  Specifically,  for 
an  event  A  defined  on  Sx  the  probability  is  given  by 

p[XeA}=  P*N- 

{i:Xi£A} 


(5.4) 
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An  example  follows. 

Example  5.3  -  Calculating  probabilities  based  on  the  PMF 

Consider  a  die  whose  sides  have  been  labeled  with  two  sides  having  1  dot,  two 
sides  having  2  dots,  and  two  sides  having  3  dots.  Hence,  S  =  {<Si,  S2?  •  •  • ,  ^6}  — 
{side  1,  side  2,  side  3,  side  4,  side  5,  side  6}.  Then  if  we  are  interested  in  the  prob¬ 
abilities  of  the  outcomes  displaying  either  1,  2,  or  3  dots,  we  would  define  a  random 
variable  as 

f  1  i  =  1,2 

X(Si)=  {  2  2  =  3,4 
{3  i  =  5,6. 

It  easily  follows  then  that  the  PMF  is  from  (5.2) 


Px{  1]  =Px[  2]  =  Px[3]  = 

Now  assume  we  are  interested  in  the  probability  that  a  2  or  3  occurs  or  A  —  {2, 3}. 
Then  from  (5.4)  we  have 

P[X£{2,3}}=px[2]+px[3}  =  l 

There  is  no  need  to  reconsider  the  original  sample  space  S  and  all  probability  cal¬ 
culations  of  interest  are  obtainable  from  the  PMF. 

0 


5.5  Important  Probability  Mass  Functions 

We  have  already  encountered  many  of  these  in  Chapter  4.  We  now  summarize  these 
in  our  new  notation.  Since  the  sample  spaces  Sx  consist  of  integer  values  we  will 
replace  the  notation  x,  by  k,  which  indicates  an  integer. 

5.5.1  Bernoulli 


px[k]  =  {l  p  t= i.  (5-5) 

The  PMF  is  shown  in  Figure  5.4  and  is  recognized  as  a  sequence  of  numbers  that  is 
nonzero  only  for  the  indices  k  =  0, 1.  It  is  convenient  to  represent  the  Bernoulli  PMF 
using  the  shorthand  notation  Ber(p).  With  this  notation  we  replace  the  description 
that  “ X  is  distributed  according  to  a  Bernoulli  random  variable  with  PMF  Ber(p)” 
by  the  shorthand  notation  X  ~  Ber(p),  where  ~  means  “is  distributed  according 
to”. 
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Figure  5.4:  Bernoulli  probability  mass  function  for  p  —  0.25. 

5.5.2  Binomial 

Px[k]  =  ^  Pk^  ~p)M~k  k  =  0, 1, . . .  ,M.  (5.6) 

The  PMF  is  shown  in  Figure  5.5.  The  shorthand  notation  for  the  binomial  PMF  is 
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Figure  5.5:  Binomial  probability  mass  function  for  M  =  10, p  =  0.25. 

bin(M,p).  The  location  of  the  maximum  of  the  PMF  can  be  shown  to  be  given  by 
[(M  + 1  )p],  where  [x]  denotes  the  largest  integer  less  than  or  equal  to  x  (see  Problem 
5.7). 
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5.5.3  Geometric 

px[k]  =  (1  -  p)k~lp  k  =  l,2, -  (5.7) 

The  PMF  is  shown  in  Figure  5.6.  The  shorthand  notation  for  the  geometric  PMF 
is  geoni(p) . 


Figure  5.6:  Geometric  probability  mass  function  for  M  =  10,p  =  0.25. 


5.5.4  Poisson 


\k 

Px[k]  =  exp(— A)—  k  =  0,1,2,...  (5.8) 

where  A  >  0.  The  PMF  is  shown  in  Figure  5.7  for  several  values  of  A.  Note  that 
the  maximum  occurs  at  [A]  (see  Problem  5.11).  The  shorthand  notation  is  Pois(A). 


5.6  Approximation  of  Binomial  PMF  by  Poisson  PMF 

The  binomial  and  Poisson  PMFs  are  related  to  each  other  under  certain  condi¬ 
tions.  This  relationship  helps  to  explain  why  the  Poisson  PMF  is  used  in  various 
applications,  primarily  traffic  modeling  as  described  further  in  Section  5.10.  The  re¬ 
lationship  is  as  follows.  If  in  a  binomial  PMF,  we  let  M  — >■  oo  as  p  — >  0  such  that  the 
product  A  =  Mp  remains  constant,  then  bin (M,p)  ->  Pois(A).  Note  that  A  =  Mp 
represents  the  expected  or  average  number  of  successes  in  M  Bernoulli  trials  (see 
Chapter  6  for  definition  of  expectation).  Hence,  by  keeping  the  average  number  of 
successes  fixed  but  assuming  more  and  more  trials  with  smaller  and  smaller  prob¬ 
abilities  of  success  on  each  trial,  we  are  led  to  a  Poisson  PMF.  As  an  example,  a 
comparison  is  shown  in  Figure  5.8  for  M  =  10,  p  =  0.5  and  M  =  100,  p  =  0.05.  This 


Px  [fc]  Px  [k 
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(a)  A  =  2  (b)  A  =  5 


Figure  5.7:  The  Poisson  probability  mass  function  for  different  values  of  A. 


k 


(a)  M  =  10,  p  =  0.5 


k 


(b)  M  =  100,  p  =  0.05 


Figure  5.8:  The  Poisson  approximation  to  the  binomial  probability  mass  function. 


result  is  primarily  useful  since  Poisson  PMFs  are  easier  to  manipulate  and  also  arise 
in  the  modeling  of  point  processes  as  described  in  Chapter  21. 

To  make  this  connection  we  have  for  the  binomial  PMF  with  p  —  A/M  — >  0  as 


5. 7.  TRANSFORMATION  OF  DISCRETE  RANDOM  VARIABLES 
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M  — )•  oo  (and  A  fixed) 


Px  [k]  = 


Pk(  1  ~p) 


M-k 


M-k 


Ml  /Ay/ 

{M-k)\k\\MJ  V  MJ 

(M)k  Xk  (1  -  X/M)m 
k\  Mk  (1  -  A/M)* 

A*  (Af)*  (1  -  X/M)m 
~k\  Mk  (1  -  X/M)k  ' 

But  for  a  fixed  fc,  as  M  oo,  we  have  that  (M)k/Mk  -»  1.  Also, 
(1  —  X/M)k  — »  1  so  that  we  need  only  find  the  limit  of  g(M)  —  (1 
M  oo.  This  is  shown  in  Problem  5.15  to  be  exp  (—A)  and  therefore 


1.  Also,  for  a  fixed  ft, 
M)  =  (1  -  X/M)m  as 


Px [k]  —  exp(-A). 

Also,  since  the  binomial  PMF  is  defined  for  k  =  0, 1, . . . ,  M,  as  M  oo  the  limiting 
PMF  is  defined  for  k  =  0, 1, . . ..  This  result  can  also  be  found  using  characteristic 
functions  as  shown  in  Chapter  6. 


5.7  Transformation  of  Discrete  Random  Variables 

It  is  frequently  of  interest  to  be  able  to  determine  the  PMF  of  a  transformed  random 
variable.  Mathematically,  we  desire  the  PMF  of  the  new  random  variable  Y  =  g(X ), 
where  X  is  a  discrete  random  variable.  For  example,  consider  a  die  whose  faces  are 
labeled  with  the  numbers  0,0, 1,1, 2, 2.  We  wish  to  find  the  PMF  of  the  number 
observed  when  the  die  is  tossed,  assuming  all  sides  are  equally  likely  to  occur.  If 
the  original  sample  space  is  composed  of  the  possible  cube  sides  that  can  occur,  so 
that  Sx  =  (1, 2, 3, 4, 5, 6},  then  the  transformation  appears  as  shown  in  Figure  5.9. 
Specifically,  we  have  that 

{yi  =  0  if  x  —  x\  —  1  or  x  =  X2  =  2 

y 2  =  1  if  x  —  xs  =  3  or  x  =  X4  =  4 

2/3  =  2  if  x  =  £5  =  5  or  x  =  xq  =  6. 

Note  that  the  transformation  is  many-to-one.  Since  events  such  as  {y  :  y  =  y\  =  0} 

and  {x  :  x  =  x\  —  l,x  =  x<i  =  2},  for  example,  are  equivalent,  they  should  be 
assigned  the  same  probability.  Thus,  using  the  property  that  the  events  {X  =  X{} 
are  simple  events  defined  on  we  have  that 

'  px[l]+Px[2\  =  \  i  =  1 
PrlVi]  =  <  px[ 3]  +Px[ 4]  =  |  i  =  2 
l  Px{5\  +i?x[6]  =  |  *  =  3. 
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y 


In  general,  we  have  that 


PY[Vi]  =  Y  Px[xj}- 

{j-g{xj)=yz} 


We  just  sum  up  the  probabilities  for  all  the  values  of  X  =  xj  that  are  mapped 
into  Y  —  yi.  This  is  reminiscent  of  (5.2)  in  which  the  transformation  was  from 
the  objects  Sj  dehned  on  S  to  the  numbers  Xi  dehned  on  Sx-  In  fact,  it  is  nearly 
identical  except  that  we  have  replaced  the  objects  that  are  to  be  transformed  by 
numbers,  i.e. ,  the  oq’s.  Some  examples  of  this  procedure  follow. 

Example  5.4  —  One-to-one  transformation  of  Bernoulli  random  variable 

If  X  ~  Ber (p)  and  Y  =  2X  —  1,  determine  the  PMF  of  Y.  The  sample  space  for  X 
is  Sx  —  {0, 1}  and  consequently  that  for  Y  is  Sy  =  {  —  1,1}.  It  follows  that  x\  =  0 
maps  into  y\  —  —1  and  #2  =  1  maps  into  1/2  =  1.  As  a  result,  we  have  from  (5.9) 


Py[~  1] 


Py[  1] 


Px[  0]  =  1  -p 
Px[  1]  =P- 


Note  that  this  mapping  is  particularly  simple  since  it  is  one-to-one.  A  slightly  more 
complicated  example  is  next. 

0 


Example  5.5  —  Many-to-one  transformation 

Let  the  transformation  be  Y  —  g(X)  —  X 2  which  is  dehned  on  the  sample  space 
Sx  —  { —  1,  0, 1}  so  that  Sy  —  {0,1}.  Clearly,  g(xj)  —  x2-  —  0  only  for  Xj  —  0. 
Hence, 

PY  [0]  =  Px  [0]  • 

=  Xj  =  1  for  Xj  =  —1  and  Xj  =  1.  Thus,  using  (5.9)  we  have 


Py[  f] 


Xj 


Y  px 

{xj:x?=l} 

Px[— 1]  +Px[  !]• 


However,  g(xj ) 
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Note  that  we  have  determined  py[yi]  by  summing  the  probabilities  of  all  the  Xj’s 
that  map  into  y\  via  the  transformation  y  =  g(x).  This  is  in  essence  the  meaning  of 
(5.9). 

❖ 


Example  5.6  —  Many-to-one  transformation  of  Poisson  random  variable 

Now  consider  X  ~  Pois(A)  and  define  the  transformation  Y  —  g{X)  as 


Y  = 


1  if  X  =  k  is  even 
1  if  X  =  k  is  odd. 


To  find  the  PMF  for  Y  we  use 


PY[k\  =  P[Y  =  k] 


P[X  is  even]  k  =  1 
P[X  is  odd]  k  =  -1. 


We  need  only  determine  py[  1]  since  py[— 1]  =  1  —  py[  1].  Thus,  from  (5.9) 


OO 


Py[1]  =  E  Px  bl 

j= 0  and  even 
oo 


E  exp(-A)-f- 

j— 0  and  even  J 

To  evaluate  the  infinite  sum  in  closed  form  we  use  the  following  atrick” 


V  — 

j\ 

j— 0  and  even 


OO  \  7  I  OO 

1  v  AJ  1  v 

~  9  Z/  n\  "b  9 


(-\)j 


n\ 


3= 0 


3=0 


=  \  exp(A)  +  i  exp  (—A) 


since  the  Taylor  expansion  of  exp(x)  is  known  to  be  xJ  /i-  (see  Problem  5.22) 

_  *J 

Finally,  we  have  that 


Py[  1]  =  exp(-A) 


1  1 
2  exP(A)  +  j  exP(“A) 


=  ^(1  +  exp(-2A)) 


Py[~  1] 


1  -py[l]  =  -(1  -exp(-2A)). 


5.8  Cumulative  Distribution  Function 

An  alternative  means  of  summarizing  the  probabilities  of  a  discrete  random  variable 
is  the  cumulative  distribution  function  (CDF).  It  is  sometimes  referred  to  more 
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succinctly  as  the  distribution  function.  The  CDF  for  a  random  variable  X  and 
evaluated  at  x  is  given  by  P[{real  numbers  x'  :  xf  <  a;}],  which  is  the  probability 
that  X  lies  in  the  semi- infinite  interval  (— oo,x].  It  is  therefore  defined  as 

Fx(x)  =  P[X  <  x]  —  oo  <  x  <  oo.  (5.10) 

It  is  important  to  observe  that  the  value  X  =  x  is  included  in  the  interval.  As  an 
example,  if  X  ~  Ber(p),  then  the  PMF  and  the  corresponding  CDF  are  shown  in 
Figure  5.10.  Because  the  random  variable  takes  on  only  the  values  0  and  1,  the  CDF 
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(a)  PMF 


(b)  CDF 


Figure  5.10:  The  Bernoulli  probability  mass  function  and  cumulative  distribution 
function  for  p  =  0.25. 

changes  its  value  only  at  these  points,  where  it  jumps.  The  CDF  can  be  thought  of 
as  a  “running  sum”  which  adds  up  the  probabilities  of  the  PMF  starting  at  — oo  and 
ending  at  +oo.  When  the  value  x  of  F\  (x)  encounters  a  nonzero  value  of  the  PMF, 
the  additional  mass  causes  the  CDF  to  jump,  with  the  size  of  the  jump  equal  to  the 
value  of  the  PMF  at  that  point.  For  example,  referring  to  Figure  5.10b,  at  x  =  0  we 
have  -Fx(O)  —  px[ 0]  =  1  —  p  =  3/4  and  at  x  =  1  we  have  Fx(l)  —  Px[0]  +px[l]  =  1, 
with  the  jump  having  size  px[  1]  =  p  =  1/4.  Another  example  follows. 

Example  5.7  —  CDF  for  geometric  random  variable 

Since  px[k)  =  (1  -  p)k_1p  for  A;  =  1, 2, . . .,  we  have  the  CDF 

f  0  x  <  1 

*  >  1 


Fx(x)  = 
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where  [a;]  denotes  the  largest  integer  less  than  or  equal  to  x.  This  evaluates  to 


Fx  (x)  =  { 


0 

P 

P+  (1  ~P)P 
etc. 


x  <  1 

1  <  x  <  2 

2  <  x  <  3 


The  PMF  and  CDF  are  plotted  in  Figure  5.11  for  p  =  0.5.  Since  the  CDF  jumps  at 


(a)  PMF  (b)  CDF 


Figure  5.11:  The  geometric  probability  mass  function  and  cumulative  distribution 
function  for  p  =  0.5. 

each  nonzero  value  of  the  PMF  and  the  jump  size  is  that  value  of  the  PMF,  we  can 
recover  the  PMF  from  the  CDF.  In  particular,  we  have  that 

Px  M  =  Fx(x+)  -  Fx(x~) 

where  x+  denotes  a  value  just  slightly  larger  than  x  and  x  denotes  a  value  just 
slightly  smaller  than  x.  Thus,  if  Fx{x)  does  not  have  a  discontinuity  at  x  the 
value  of  the  PMF  is  zero.  At  a  discontinuity  the  value  of  the  PMF  is  just  the 
jump  size  as  previously  asserted.  Also,  because  of  the  definition  of  the  CDF,  i.e., 
that  Fx(x)  =  P[X  <  x]  =  P[X  <  x  or  X  =  x],  the  value  of  Fx(x)  is  the  value 
after  the  jump.  The  CDF  is  said  to  be  right- continuous  which  is  sometimes  stated 
mathematically  as  limx_>x+  Fx(x)  =  Fx(x o)  at  the  point  x  =  x0. 

❖ 

From  the  previous  example  we  see  that  the  PMF  and  CDF  are  equivalent  descrip¬ 
tions  of  the  probability  assignment  for  X.  Either  one  can  be  used  to  find  the 
probability  of  X  being  in  an  interval  (even  an  interval  of  length  zero).  For  example, 
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to  determine  P[ 3/2  <  X  <  7/2]  for  the  geometric  random  variable 


3  7 

2<XS2 


=  Px  [2]  +  Px  [3] 


as  is  evident  by  referring  to  Figure  5.11b.  We  need  to  be  careful,  however,  to 
note  whether  the  endpoints  of  the  interval  are  included  or  not.  This  is  due  to  the 
discontinuities  of  the  CDF.  Because  of  the  definition  of  the  CDF  as  the  probability 
of  X  being  within  the  interval  (—00,  x],  which  includes  the  right-most  point,  we  have 
for  the  interval  (a,  b] 


P[a  <  X  <  b]  =  Fx{b+)  -  Fx{a+).  (5.11) 

Also,  the  other  intervals  (a,  6),  [a,  6),  and  [a,  b]  will  in  general  have  different  prob¬ 
abilities  than  that  given  by  (5.11).  From  Figure  5.11b  and  (5.11)  we  have  as  an 
example  that 

P[2  <  X  <  3]  =  Fx( 3+)  -  FX(2+)=PX[3]  =  (1  -pfp  =  0.125 


but 

P[  2<X<3\  =  Fx(  3+)  -  Fx{  2")  =  (1  -p)p+  (1  ~p)2p  =  0.375. 

From  the  definition  of  the  CDF  and  as  further  illustrated  in  Figures  5.10  and 
5.11  the  CDF  has  several  important  properties.  They  are  now  listed  and  proven. 

Property  5.3  —  CDF  is  between  0  and  1. 

0  <  Fx(x)  <1  —  oo  <  x  <  oo 

Proof:  Since  by  definition  Fx{x)  =  P[X  <  x]  is  a  probability  for  all  x,  it  must  lie 
between  0  and  1. 

□ 


Property  5.4  -  Limits  of  CDF  as  x  ->  —  oo  and  as  x  ->  oo 


lim  Fx(x)  =  0 

x— >— oo 

lim  Fx(x)  =  1. 

x— >-f  OO 


lim  Fx(x)  -  P[{5  :  X(s)  <  -oo}]  =  P[0]  =  0 

oc  y  oo 


Proof: 
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since  the  values  that  X (5)  can  take  on  do  not  include  —00.  Also, 

lim  Fx(x)  =  P[{s  :  X(s)  <  +00}]  =  P[S]  =  1 

x — >--}-oo 


since  the  values  that  X  ( s )  can  take  on  are  all  included  on  the  real  line. 


□ 


Property  5.5  —  CDF  is  monotonically  increasing. 

A  monotonically  increasing  function  g(-)  is  one  in  which  for  every  x\  and  x<i  with 
x\  <  %2,  if  follows  that  g(x\)  <  g(x 2)  or  the  function  increases  or  stays  the  same  as 
the  argument  increases  (see  also  Problem  5.29). 

Proof: 

Fx  (#2)  —  P[X  <  x 2]  (definition) 

=  P[{X  <  x\)  U  (x\  <  X  <  X2)] 

=  P[X  <  xi]  +  P[x  1  <  X  <  X2]  (Axiom  3) 

=  Fx(x  1)  +  P[x  1  <  X  <  X2]  >  Fx(x  1).  (definition  and  Axiom  1) 

Alternatively,  if  A  =  {—00  <  X  <  rri}  and  B  =  {—00  <  X  <  X2}  with  x\  <  X2, 

then  A  C  B.  Prom  Property  3.5  (montonicity)  Fx(x2)  =  P[B ]  >  P[A ]  =  Fx{x\). 

□ 


Property  5.6  —  CDF  is  right-continuous. 

By  right-continuous  it  is  meant  that  as  we  approach  the  point  xo  from  the  right, 
the  limiting  value  of  the  CDF  should  be  the  value  of  the  CDF  at  that  point.  Math¬ 
ematically,  it  is  expressed  as 


lim  Fx(x)  =  Fx(x 0). 

X-AXq 


Proof: 

The  proof  relies  on  the  continuity  property  of  the  probability  function.  It  can  be 
found  in  [Ross  2002]. 

□ 


Property  5.7  —  Probability  of  interval  found  using  the  CDF 

P[a  <  X  <  b\  =  Fx{b)  -  Fx(a)  (5.12) 

or  more  explicitly  to  remind  us  of  possible  discontinuities 

P[a  <  X  <  b\  =  Fx{b+)  -  Fx(a+).  (5.13) 
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Proof: 

Since  for  a  <  b 


{— oo  <  X  <  b}  —  {—oo  <  X  <  a}  U  {a  <  X  <  b} 

and  the  intervals  on  the  right-hand-side  are  disjoint  (mutually  exclusive  events),  by 
Axiom  3 

P[-oo  <X  <b\  =  P[-oo  <X  <a]  +  P[a<X  <b\ 
or  rearranging  terms  we  have  that 

P[a  <  X  <  b]  =  P[ — oo  <X  <b\-  P[- oo  <  X  <  a]  =  Fx{b)  -  Fx(a). 


5.9  Computer  Simulation 

In  Chapter  2  we  discussed  how  to  simulate  a  discrete  random  variable  on  a  digital 
computer.  In  particular,  Section  2.4  presented  some  MATLAB  code.  We  now 
continue  that  discussion  to  show  how  to  simulate  a  discrete  random  variable  and 
estimate  its  PMF  and  CDF.  Assume  that  X  can  take  on  values  in  Sx  =  {1,2,3} 
with  a  PMF 

{Pi  =  0.2  if  x  =  x\  —  1 
P2  =  0.6  if  x  =  x2  =  2 
ps  =  0.2  if  x  =  =  3. 

The  PMF  and  CDF  are  shown  in  Figure  5.12.  The  code  from  Section  2.4  for  gener¬ 
ating  M  realizations  of  X  is 

for  i=l:M 
u=rand(l , 1) ; 
if  u<=0.2 
x(i,l)=l; 

elseif  u>0.2  &  u<=0.8 
x(i,l)=2; 
elseif  u>0.8 
x(i,l)=3; 
end 
end 

Recall  that  U  is  a  random  variable  whose  values  are  equally  likely  to  fall  within  the 
interval  (0, 1).  It  is  called  the  uniform  random  variable  and  is  described  further  in 
Chapter  10.  Now  to  estimate  the  PMF  px[k]  =  P[X  =  k]  for  k  —  1, 2, 3  we  use  the 
relative  frequency  interpretation  of  probability  to  yield 

„  r.  n  Number  of  outcomes  equal  to  k 
Px[k\  ~ 


M 


k  =  1,2,3. 


(5.14) 
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(a)  PMF 


(b)  CDF 


Figure  5.12:  The  probability  mass  function  and  cumulative  distribution  function  for 
computer  simulation  example. 


For  M  =  100  this  is  shown  in  Figure  5.13a.  Also,  the  CDF  is  estimated  for  all  x  via 


Fx(x) 


Number  of  outcomes  <  x 

M 


(5.15) 


or  equivalently  by 

Fx(x)  =  Px[k]  (5.16) 

{k:k<x} 

and  is  shown  in  Figure  5.13b.  For  finite  sample  spaces  this  approach  to  simulate 
a  discrete  random  variable  is  adequate.  But  for  infinite  sample  spaces  such  as  for 
the  geometric  and  Poisson  random  variables  a  different  approach  is  needed.  See 
Problem  5.30  for  a  further  discussion. 

Before  concluding  our  discussion  we  wish  to  point  out  a  useful  property  of  CDFs 
that  simplifies  the  computer  generation  of  random  variable  outcomes.  Note  from 
Figure  5.12b  with  u  =  Fx  (x)  that  we  can  define  an  inverse  CDF  as  x  =  F^l(u) 
where 

f  1  if  0  <  u  <  0.2 
x  =  F^l(u)  =  <  2  if  0.2  <  u  <  0.8 

{  3  if  0.8  <  u  <  1 

or  we  choose  the  value  of  x  as  shown  in  Figure  5.14.  But  if  u  is  the  outcome 
of  a  uniform  random  variable  U  on  (0, 1),  then  this  procedure  is  identical  to  that 
implemented  in  the  previous  MATLAB  program  used  to  generate  realizations  of 
X.  A  more  general  program  is  given  in  Appendix  6B  as  PMFdata.m.  This  is  not 
merely  a  coincidence  but  can  be  shown  to  follow  from  the  definition  of  the  CDF. 
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Figure  5.13:  The  estimated  probability  mass  function  and  corresponding  estimated 
cumulative  distribution  function. 


Figure  5.14:  Relationship  of  inverse  CDF  to  generation  of  discrete  random  variable. 
Value  of  u  is  mapped  into  value  of  x. 


Although  little  more  than  a  curiousity  now,  it  will  become  important  when  we 
simulate  continuous  random  variables  in  Chapter  10. 


5.10  Real-World  Example  —  Servicing  Customers 

A  standard  problem  in  many  disciplines  is  the  allocation  of  resources  to  service 
customers.  It  occurs  in  determining  the  number  of  cashiers  needed  at  a  store, 
the  computer  capacity  needed  to  service  download  requests,  and  the  amount  of 
equipment  necessary  to  service  phone  customers,  as  examples.  In  order  to  service 
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these  customers  in  a  timely  manner,  it  is  necessary  to  know  the  distribution  of 
arrival  times  of  their  requests.  Since  this  will  vary  depending  on  many  factors 
such  as  time  of  day,  popularity  of  a  file  request,  etc.,  the  best  we  can  hope  for  is 
a  determination  of  the  probabilities  of  these  arrivals.  As  we  will  see  shortly,  the 
Poisson  probability  PMF  is  particularly  suitable  as  a  model.  We  now  focus  on  the 
problem  of  determining  the  number  of  cashiers  needed  in  a  supermarket. 

A  supermarket  has  one  express  lane  open  from  5  to  6  PM  on  weekdays  (Monday 
through  Friday).  This  time  of  the  day  is  usually  the  busiest  since  people  tend  to 
stop  on  their  way  home  from  work  to  buy  groceries.  The  number  of  items  allowed  in 
the  express  lane  is  limited  to  10  so  that  the  average  time  to  process  an  order  is  fairly 
constant  at  about  1  minute.  The  manager  of  the  supermarket  notices  that  there  is 
frequently  a  long  line  of  people  waiting  and  hears  customers  grumbling  about  the 
wait.  To  improve  the  situation  he  decides  to  open  additional  express  lanes  during 
this  time  period.  If  he  does,  however,  he  will  have  to  “pull”  workers  from  other  jobs 
around  the  store  to  serve  as  cashiers.  Hence,  he  is  reluctant  to  open  more  lanes  than 
necessary.  He  hires  Professor  Poisson  to  study  the  problem  and  tell  him  how  many 
lanes  should  be  opened.  The  manager  tells  Professor  Poisson  that  there  should  be 
no  more  than  one  person  waiting  in  line  95%  of  the  time.  Since  the  processing  time 
is  1  minute,  there  can  be  at  most  two  arrivals  in  each  time  slot  of  1  minute  length. 
He  reasons  that  one  will  be  immediately  serviced  and  the  other  will  only  have  to 
wait  a  maximum  of  1  minute.  After  a  week  of  careful  study,  Professor  Poisson  tells 
the  manager  to  open  two  lanes  from  5  to  6  PM.  Here  is  his  reasoning. 

First  Professor  Poisson  observes  the  arrivals  of  customers  in  the  express  lane 
on  a  Monday  from  5  to  6  PM.  The  observed  arrivals  are  shown  in  Figure  5.15, 
where  the  arrival  times  are  measured  in  seconds.  On  Monday  there  are  a  total  of 


Figure  5.15:  Arrival  times  at  one  express  lane  on  Monday  (a  ‘+’  indicates  an  arrival). 
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80  arrivals.  He  repeats  his  experiment  on  the  following  4  days  (Tuesday  through 
Friday)  and  notes  total  arrivals  of  68,  70,  59,  and  66  customers,  respectively.  On 
the  average  there  are  68.6  arrivals,  which  he  rounds  up  to  70.  Thus,  the  arrival  rate 
is  1.167  customers  per  minute.  He  then  likens  the  arrival  process  to  one  in  which 
the  5  to  6  PM  time  interval  is  broken  up  into  3600  time  slots  of  1  second  each.  He 
reasons  that  there  is  at  most  1  arrival  in  a  given  time  slot  and  there  may  be  no 
arrivals  in  that  time  slot.  (This  of  course  would  not  be  valid  if  for  instance,  two 
friends  did  their  shopping  together  and  arrived  at  the  same  time.)  Hence,  Professor 
Poisson  reasons  that  a  good  arrival  model  is  a  sequence  of  independent  Bernoulli 
trials,  where  0  indicates  no  arrival  and  1  indicates  an  arrival  in  each  1-second  time 
slot.  The  probability  p  of  a  1  is  estimated  from  his  observed  data  as  the  number 
of  arrivals  from  5  to  6  PM  divided  by  the  total  number  of  time  slots  in  seconds. 
This  yields  p  =  70/3600  =  0.0194  for  each  1-second  time  slot.  Instead  of  using  the 
binomial  PMF  to  describe  the  number  of  arrivals  in  each  1-minute  time  slot  (for 
which  p  =  0.0194  and  M  =  60),  he  decides  to  approximate  it  using  his  favorite 
PMF,  the  Poisson  model.  Therefore,  the  probability  of  k  arrivals  (or  successes)  in 
a  time  interval  of  60  seconds  would  be 

\  ^ 

pXl[k]  =  exp(— A;  =  0,1,...  (5.17) 

where  the  subscripts  on  X  and  A  are  meant  to  remind  us  that  we  will  initially 
consider  the  arrivals  at  one  express  lane.  The  value  of  Ai  to  be  used  is  Ai  =  Mp, 

A 

which  is  estimated  as  Ai  =  Mp  =  60(70/3600)  =  7/6.  This  represents  the  expected 
number  of  customers  arriving  in  the  1-minute  interval.  According  to  the  manager’s 
requirements,  within  this  time  interval  there  should  be  at  most  2  customers  arriving 
95%  of  the  time.  Hence,  we  require  that 


2 

P[Xi  <  2]  =  y; px i  [k]  >  0.95. 

A;=0 


But  from  (5.17)  this  becomes 


P[X i  <  2]  =  exp(— Ai) 


1  +  Ai  +  —A 


=  0.88 


using  Ai  =  7/6.  Hence,  the  probability  of  2  or  fewer  customers  arriving  at  the 
express  lane  is  not  greater  than  0.95.  If  a  second  express  lane  is  opened,  then  the 
average  number  of  arrivals  at  each  lane  during  the  1-minute  time  interval  will  be 
halved  to  35.  Therefore,  the  Poisson  PMF  for  the  number  of  arrivals  at  each  lane 
will  be  characterized  by  A2  =  7/12.  Now,  however,  there  are  two  lanes  and  two  sets 
of  arrivals.  Since  the  arrivals  are  modeled  as  independent  Bernoulli  trials,  we  can 
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assert  that 


P[ 2  or  fewer  arrivals  at  both  lanes 


P[ 2  or  fewer  arrivals  at  lane  1] 
■P[ 2  or  fewer  arrivals  at  lane  2] 
P[ 2  or  fewer  arrivals  at  lane  l]2 
P[Xi  <  2] 2 


so  that 


P[2  or  fewer  arrivals  at  both  lanes]  =  ( XjPxx[k] 


.*= o 


1 


n  2 


exp(— A2)  (  1  +  A2  +  -A| 


0.957 


which  meets  the  requirement.  An  example  is  shown  for  one  of  the  two  express  lanes 
with  an  average  number  of  customer  arrivals  per  minute  of  7/12  in  Figures  5.16  and 
5.17,  with  the  latter  an  expanded  version  of  the  former.  The  dashed  vertical  lines 


Figure  5.16:  Arrival  times  at  one  of  the  two 


express  lanes  (a  ‘+’  indicates  an  arrival). 


in  Figure  5.17  indicate  1-minute  intervals.  There  are  no  1-minute  intervals  with 
more  than  2  arrivals,  as  we  expect. 
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Figure  5.17:  Expanded  version  of  Figure  5.16  (a  £+’  indicates  an  arrival).  Time 
slots  of  60  seconds  are  shown  by  dashed  lines. 
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Problems 

5.1  (w)  Draw  a  picture  depicting  a  mapping  of  the  outcome  of  a  die  toss,  i.e.,  the 

pattern  of  dots  that  appear,  to  the  numbers  1, 2, 3, 4, 5, 6. 

5.2  (w)  Repeat  Problem  5.1  for  a  mapping  of  the  sides  that  display  1,  2,  or  3  dots 

to  the  number  0  and  the  remaining  sides  to  the  number  1. 

5.3  (w)  Consider  a  random  experiment  for  which  S  —  {s*  :  Si  =  i,  i  =  1, 2, . . . ,  10} 

and  the  outcomes  are  equally  likely.  If  a  random  variable  is  defined  as  X(Si)  = 

,  find  Sx  and  the  PMF. 

5.4  (^)  (w)  Consider  a  random  experiment  for  which  S  =  {Si  :  Si  =  —3,  —2,  —1, 0, 

1, 2, 3}  and  the  outcomes  are  equally  likely.  If  a  random  variable  is  defined  as 
X(<Si)  =  <s?,  find  Sx  and  the  PMF. 

5.5  (w)  A  man  is  late  for  his  job  by  s*  =  i  minutes,  where  i  —  1,2,....  If  P[Si]  = 

(l/2)z  and  he  is  fined  $0.50  per  minute,  find  the  PMF  of  his  fine.  Next  find 
the  probability  that  he  will  be  fined  more  than  $10. 
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5.6  (o)  (w)  ^  P\'[k]  =  ot.pk  for  k  =  2.3....  is  to  be  a  valid  PMF,  what  are  the 

possible  values  for  a  and  p ? 

5.7  (t)  The  maximum  value  of  the  binomial  PMF  occurs  for  the  unique  value  k  = 

[( M  +  1  )p],  where  [x\  denotes  the  largest  integer  less  than  or  equal  to  #,  if 
(M  +  l)p  is  not  an  integer.  If,  however,  (M  +  l)p  is  an  integer,  then  the  PMF 
will  have  the  same  maximum  value  at  k  =  (M  +  1  )p  and  k  =  (M  +  1  )p  —  1. 
For  the  latter  case  when  ( M  +  l)p  is  an  integer  you  are  asked  to  prove  this 
result.  To  do  so  first  show  that 

px[k]/px[k  -  1]  =  1  +  ^  -■ 

5.8  (^)  (w)  At  a  party  a  large  barrel  is  filled  with  99  gag  gifts  and  1  diamond  ring, 

all  enclosed  in  identical  boxes.  Each  person  at  the  party  is  given  a  chance  to 
pick  a  box  from  the  barrel,  open  the  box  to  see  if  the  diamond  is  inside,  and  if 
not,  to  close  the  box  and  return  it  to  the  barrel.  What  is  the  probability  that 
at  least  19  persons  will  choose  gag  gifts  before  the  diamond  ring  is  selected? 

5.9  (f,c)  If  X  is  a  geometric  random  variable  with  p  =  0.25,  what  is  the  probability 

that  X  >  4?  Verify  your  result  by  performing  a  computer  simulation. 

5.10  (c)  Using  a  computer  simulation  to  generate  a  geom(0.25)  random  variable, 
determine  the  average  value  for  a  large  number  of  realizations.  Relate  this  to 
the  value  of  p  and  explain  the  results. 

5.11  (t)  Prove  that  the  maximum  value  of  a  Poisson  PMF  occurs  at  k  =  [A].  Hint: 
See  Problem  5.7  for  the  approach. 

5.12  (w,c)  If  X  rsj  Pois(A),  plot  P[X  >  2]  versus  A  and  explain  your  results. 

5.13  (o)  (c)  Use  a  computer  simulation  to  generate  realizations  of  a  Pois(A)  ran¬ 
dom  variable  with  A  =  5  by  approximating  it  with  a  bin(  100,0.05)  random 
variable.  What  is  the  average  value  of  X? 

5.14  (o)  (w)  If  X  ~  bin(100, 0.01),  determine  px[5]-  Next  compare  this  to  the 
value  obtained  using  a  Poisson  approximation. 

5.15  (t)  Prove  the  following  limit: 

(7*  \  M 
1  +  —)  =  exp(rr). 

M  ) 

To  do  so  note  that  the  same  limit  is  obtained  if  M  is  replaced  by  a  continuous 
variable,  say  u,  and  that  one  can  consider  lng(u)  since  the  logarithm  is  a 
continuous  function.  Hint:  Use  L’Hospital’s  rule. 
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5.16  (f,c)  Compare  the  PMFs  for  Pois(l)  and  bin(100,0.01)  random  variables. 

5.17  (e)  Generate  realizations  of  a  Pois(l)  random  variable  by  using  a  binomial 
approximation. 

5.18  (o)  (c)  Compare  the  theoretical  value  of  P[X  —  3]  for  the  Poisson  random 
variable  to  the  estimated  value  obtained  from  the  simulation  of  Problem  5.17. 


5.19  (f)  If  X  ~  Ber(p),  find  the  PMF  for  Y  =  -X. 

5.20  (0)  (f)  If  X  -  Pois(A),  find  the  PMF  for  Y  =  2X. 

5.21  (f)  A  discrete  random  variable  X  has  the  PMF 


Px[xi\  =  <  \ 


the  PMF 

1 

2 

Xi  = 

— 

1 

4 

X2  — 

1 

8 

X3  = 

0 

1 

X4  = 

i 

16 

2 

1 

16 

X5  = 

1. 

1 

2 


If  Y  =  sin7rX,  find  the  PMF  for  Y. 

5.22  (t)  In  this  problem  we  derive  the  Taylor  expansion  for  the  function  g(x )  = 
exp(x).  To  do  so  note  that  the  expansion  about  the  point  x  =  0  is  given  by 


oo 


9(x)  =  ^2 


9{n) (0) 


n= 0 


n\ 


I 


X 


where  0)  =  #(0)  and  is  the  nth  derivative  of  g(x)  evaluated  at 

x  =  0.  Prove  that  it  is  given  by 


exp(a;) 


oo 


E 

n= 0 


5.23  (f)  Plot  the  CDF  for 


px[k\  = 


<  k 


1 
4 
1 

2 
1 
4 


k=  1 
k  =  2 
k  =  3. 


5.24  (w)  A  horizontal  bar  of  negligible  weight  is  loaded  with  three  weights  as  shown 
in  Figure  5.18.  Assuming  that  the  weights  are  concentrated  at  their  center 
locations,  plot  the  total  mass  of  the  bar  starting  at  the  left  end  (where  x  =  0 
meters)  to  any  point  on  the  bar.  How  does  this  relate  to  a  PMF  and  a  CDF? 
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0  1  2  3  4  5  6  meters 

Figure  5.18:  Weightless  bar  supporting  three  weights. 

5.25  (f )  Find  and  plot  the  CDF  of  Y  =  -X  if  X  ~  Ber(±). 

5.26  (^)  (w)  Find  the  PMF  if  X  is  a  discrete  random  variable  with  the  CDF 

{0  x  <  0 

^  0  <  x  <  5 
1  x  >  5. 


5.27  (w)  Is  the  following  a  valid  CDF?  If  not,  why  not,  and  how  could  you  modify 
it  to  become  a  valid  one? 


x  <  2 

2  <  x  <  3 

3  <  x  <  4 
x  >  4. 


5.28  (o)  (f)  If  X  has  the  CDF  shown  in  Figure  5.11b,  determine  P[ 2  <  X  <  4] 
from  the  CDF. 

5.29  (t)  Prove  that  the  function  g{x)  —  exp  (a;)  is  a  monotonically  increasing  func¬ 
tion  by  showing  that  g(x 2)  >  g(x  1)  if  x<i  >  x\. 

5.30  (c)  Estimate  the  PMF  for  a  geom(0.25)  random  variable  for  k  =  1, 2, . . . ,  20 
using  a  computer  simulation  and  compare  it  to  the  true  PMF.  Also,  estimate 
the  CDF  from  your  computer  simulation. 

5.31  (0)  (f,c)  The  arrival  rate  of  calls  at  a  mobile  switching  station  is  1  per  second. 
The  probability  of  k  calls  in  a  T  second  interval  is  given  by  a  Poisson  PMF 
with  A  =  arrival  rate  x  T.  What  is  the  probability  that  there  will  be  more 
than  100  calls  placed  in  a  1-minute  interval? 


Chapter  6 


Expected  Values  for  Discrete 
Random  Variables 

6.1  Introduction 

The  probability  mass  function  (PMF)  discussed  in  Chapter  5  is  a  complete  de- 
scription  of  a  discrete  random  variable.  As  we  have  seen,  it  allows  us  to  determine 
probabilities  of  any  event.  Once  the  probability  of  an  event  of  interest  is  determined, 
however,  the  question  of  its  interpretation  arises.  Consider,  for  example,  whether 
there  is  adequate  rainfall  in  Rhode  Island  to  sustain  a  farming  endeavor.  The  past 
history  of  yearly  summer  rainfall  was  shown  in  Figure  1.1  and  is  repeated  in  Figure 
6.1a  for  convenience.  Along  with  it,  the  estimated  PMF  of  this  yearly  data  is  shown 
in  Figure  6.1b  (see  Section  5.9  for  a  discussion  on  how  to  estimate  the  PMF).  For 
a  particular  crop  we  might  need  a  rainfall  of  between  8  and  12  inches.  This  event 
has  probability  0.5278,  obtained  by  Y^k=%Px[k\  for  the  estimated  PMF  shown  in 
Figure  6.1b.  Is  this  adequate  or  should  the  probability  be  higher?  Answers  to  such 
questions  are  at  best  problematic.  Rather  we  might  be  better  served  by  ascertaining 
the  average  rainfall  since  this  is  closer  to  the  requirement  of  an  adequate  amount 
of  rainfall.  In  the  case  of  Figure  6.1a  the  average  is  9.76  inches,  and  is  obtained  by 
summing  all  the  yearly  rainfalls  and  dividing  by  the  number  of  years.  Based  on  the 
given  data  it  is  a  simple  matter  to  estimate  the  average  value  of  a  random  variable 
(the  rainfall  in  this  case).  Some  computer  simulation  results  pertaining  to  averages 
have  already  been  presented  in  Example  2.3.  In  this  chapter  we  address  the  topic  of 
the  average  or  expected  value  of  a  discrete  random  variable  and  study  its  properties. 

6.2  Summary 

The  expected  value  of  a  random  variable  is  the  average  value  of  the  outcomes  of 
a  large  number  of  experimental  trials.  It  is  formally  defined  by  (6.1).  For  discrete 
random  variables  with  integer  values  it  is  given  by  (6.2)  and  some  examples  of  its 
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1900  1920  1  940t_  I960  1980  2000 

Year 


k  (inches) 


(a)  Annual  summer  rainfall 


(b)  Estimated  PMF 


Figure  6.1:  Annual  summer  rainfall  in  Rhode  Island  and  its  estimated  probability 
mass  function. 


determination  given  in  Section  6.4.  The  expected  value  does  not  exist  for  all  PMFs 
as  illustrated  in  Section  6.4.  For  functions  of  a  random  variable  the  expected  value 
is  easily  computed  via  (6.5).  It  is  shown  to  be  a  linear  operation  in  Section  6.5. 
Another  interpretation  of  the  expected  value  is  as  the  best  predictor  of  the  outcome 
of  an  experiment  as  shown  in  Example  6.3.  The  variability  of  the  values  exhibited  by 
a  random  variable  is  quantified  by  the  variance.  It  is  defined  in  (6.6)  with  examples 
given  in  Section  6.6.  Some  properties  of  the  variance  are  summarized  in  Section 
6.6  as  Properties  1  and  2.  An  alternative  way  to  determine  means  and  variances  of 
a  discrete  random  variable  is  by  using  the  characteristic  function.  It  is  defined  by 
(6.10)  and  for  integer  valued  random  variables  it  is  evaluated  using  (6.12),  which  is 
a  Fourier  transform  of  the  PMF.  Having  determined  the  characteristic  function,  one 
can  easily  determine  the  mean  and  variance  by  using  (6.13).  Some  examples  of  this 
procedure  are  given  in  Section  6.7,  as  are  some  further  important  properties  of  the 
characteristic  function.  An  important  property  is  that  the  PMF  may  be  obtained 
from  the  characteristic  function  as  an  inverse  Fourier  transform  as  expressed  by 
(6.19).  In  Section  6.8  an  example  is  given  to  illustrate  how  to  estimate  the  mean 
and  variance  of  a  discrete  random  variable.  Finally,  Section  6.9  describes  the  use  of 
the  expected  value  to  reduce  the  average  code  length  needed  to  store  symbols  in  a 
digital  format.  This  is  called  data  compression. 


6.3  Determining  Averages  from  the  PMF 

We  now  discuss  how  the  average  of  a  discrete  random  variable  can  be  obtained  from 
the  PMF.  To  motivate  the  subsequent  definition  we  consider  the  following  game  of 
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chance.  A  barrel  is  filled  with  US  dollar  bills  with  denominations  of  $1,  $5,  $10, 
and  $20.  The  proportion  of  each  denomination  bill  is  the  same.  A  person  playing 
the  game  gets  to  choose  a  bill  from  the  barrel,  but  must  do  so  while  blindfolded.  He 
pays  $10  to  play  the  game,  which  consists  of  a  single  draw  from  the  barrel.  After  he 
observes  the  denomination  of  the  bill,  the  bill  is  returned  to  the  barrel  and  he  wins 
that  amount  of  money.  Will  he  make  a  profit  by  playing  the  game  many  times? 
A  typical  sequence  of  outcomes  for  the  game  is  shown  in  Figure  6.2.  His  average 


Play  number 


Figure  6.2:  Dollar  winnings  for  each  play. 

winnings  per  play  is  found  by  adding  up  all  his  winnings  and  dividing  by  the  number 
of  plays  N.  This  is  computed  by 


where  Xi  is  his  winnings  for  play  i.  Alternatively,  we  can  compute  x  using  a  slightly 
different  approach.  From  Figure  6.2  the  number  of  times  he  wins  k  dollars  (where 
k  =  1, 5, 10, 20)  is  given  by  A^,  where 


Ni  =  13 
N5  =  13 
ATio  -  10 

N20  =  14. 
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As  a  result,  we  can  determine  the  average  winnings  per  play  by 


x  = 


1  •  Ni  -t*  5  •  IV5  -|- 10  •  N\ 0  4-  20  ■  N20 

Ni  +  IV5  +  Nio  +  N20 


1 


Ni 

N 

13 


+  b.*L  +  W.^  +  20-N2° 


50 


N  '  "  N 

r  13  10  14 

+  5  •  —  10  •  —  A  20  ■  — 

50  50  50 


N 


A  1 


since  N  =  Ni  +  +  Nio  +  ^20  =  50.  If  he  were  to  play  the  game  a  large  number 

of  times,  then  as  N  — »  00  we  would  have  N^/N  px[k ],  where  the  latter  is  just 
the  PMF  for  choosing  a  bill  with  denomination  fc,  and  results  from  the  relative 
frequency  interpretation  of  probability.  Then,  his  average  winnings  per  play  would 
be  found  as 


x 


1  •  px  [1]  +  5  •  px  [5]  4-  10  •  px  [10]  +  20  ■  px  [20] 


where  px[k]  =  1/4  for  k  =  1, 5, 10, 20  since  the  proportion  of  bill  denominations  in 
the  barrel  is  the  same  for  each  denomination.  It  is  now  clear  that  “on  the  average” 
he  will  lose  $1  per  play.  The  value  that  the  average  converges  to  is  called  the  expected 
value  of  X,  where  X  is  the  random  variable  that  describes  his  winnings  for  a  single 
play  and  takes  on  the  values  1,5,10,20.  The  expected  value  is  denoted  by  E[X]. 
For  this  example,  the  PMF  as  well  as  the  expected  value  is  shown  in  Figure  6.3. 
The  expected  value  is  also  called  the  expectation  of  X,  the  average  of  X,  and  the 
mean  of  X.  With  this  example  as  motivation  we  now  define  the  expected  value  of  a 
discrete  random  variable  X  as 


E{X]  =YsXiPx[xi} 

i 


where  the  sum  is  over  all  values  of  X{  for  which  px[%i]  is  nonzero.  It  is  determined 
from  the  PMF  and  as  we  have  seen  coincides  with  our  notion  of  the  outcome  of  an 
experiment  in  the  “long  run”  or  “on  the  average.”  The  expected  value  may  also  be 
intepreted  as  the  best  prediction  of  the  outcome  of  a  random  experiment  for  a  single 
trial  (to  be  described  in  Example  6.3).  Finally,  the  expected  value  is  analogous  to 
the  center  of  mass  of  a  system  of  linearly  arranged  masses  as  illustrated  in  Problem 
6.1. 
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Figure  6.3:  PMF  and  expected  value  of  dollar  bill  denomination  chosen. 

6.4  Expected  Values  of  Some  Important  Random 
Variables 

The  definition  of  the  expected  value  was  given  by  (6.1).  When  the  random  variable 
takes  on  only  integer  values,  we  can  rewrite  it  as 

oo 

E[X]  =  Y  kpx[k ].  (6.2) 

k— — oo 

We  next  determine  the  expected  values  for  some  important  discrete  random  variables 
(see  Chapter  5  for  a  definition  of  the  PMFs). 

6.4.1  Bernoulli 

If  X  ~  Ber(p),  then  the  expected  value  is 

l 

E[X]  =  Y  kPx  [*] 

k= 0 

=  0  •  (1  —  p)  +  1  •  p 

=  p- 

Note  that  E[X)  need  not  be  a  value  that  the  random  variable  takes  on.  In  this  case, 
it  is  between  X  =  0  and  X  =  1. 
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6.4.2  Binomial 

If  X  ~  bin (M,p),  then  the  expected  value  is 


M 


E[X]  =  £  kPx  [k] 


k= 0 
M 


EM.lAi 

k= 0 


k 


- p ) 


M-k 


To  evaluate  this  in  closed  form  we  will  need  to  find  an  expression  for  the  sum. 
Continuing,  we  have  that 


M 


*1*1  =  Xknrz'' 


pk(  1-p) 


A;=0 


(M  —  k)\k\ 


M-k 


M 


Mpj: 


(M-  1)! 


k 


(M  —  k)\(k  —  1)! 


and  letting  M'  =  M  —  1,  k'  =  k  —  1,  this  becomes 


M' 

BIX]  = 

k'= 0 
M' 


{M'  -  k'\)k'\ 


T,Pk(  1-P) 


M'-k' 


Jfe' 


=  Mp 


since  the  summand  is  just  the  PMF  of  a  bin(M',p)  random  variable.  Therefore,  we 
have  that  E[X\  =  Mp  for  a  binomial  random  variable.  This  derivation  is  typical  in 
that  we  attempt  to  manipulate  the  sum  into  one  whose  summands  are  the  values  of 
a  PMF  and  so  the  sum  must  evaluate  to  one.  Intuitively,  we  expect  that  if  p  is  the 
probability  of  success  for  a  Bernoulli  trial,  then  the  expected  number  of  successes 
for  M  independent  Bernoulli  trials  (which  is  binomially  distributed)  is  Mp. 


6.4.3  Geometric 

If  X  ~  geom(p),  the  the  expected  value  is 

oo 

£[*]=5>(i-p)fc~y 

k= 1 

To  evaluate  this  in  closed  form,  we  need  to  modify  the  summand  to  be  a  PMF, 
which  in  this  case  will  produce  a  geometric  series.  To  do  so  we  use  differentiation 
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by  first  letting  q  =  1  —  p  to  produce 


E[X] 


But  since  0  <  q  <  1  we  have  upon  using  the  formula  for  the  sum  of  a  geometric 
series  or  YlkLi  Qk  =  9/(1  —  q)  that 

£IX1  =  4  (w) 

=  (i  —  g)  —  g(— i) 

i 

v 

The  expected  number  of  Bernoulli  trials  until  the  first  success  (which  is  geometrically 
distributed)  is  E[X]  =  1/p.  For  example,  if  p  —  1/10,  then  on  the  average  it  takes 
10  trials  for  a  success,  an  intuitively  pleasing  result. 


6.4.4  Poisson 

If  X  ~  Pois(A),  then  it  can  be  shown  that  E[X]  =  A.  The  reader  is  asked  to 
verify  this  in  Problem  6.5.  Note  that  this  result  is  consistent  with  the  Poisson 
approximation  to  the  binomial  PMF  since  the  approximation  constrains  Mp  (the 
expected  value  of  the  binomial  random  variable)  to  be  A  (the  expected  value  of  the 
Poisson  random  variable). 


Not  all  PMFs  have  expected  values. 


Discrete  random  variables  with  a  finite  number  of  values  always  have  expected 
values.  In  the  case  of  a  countably  infinite  number  of  values,  a  discrete  random 
variable  may  not  have  an  expected  value.  As  an  example  of  this,  consider  the  PMF 


px[k\  - 


4/tt2 

k2 


k  —  1,2, ... . 


This  is  a  valid  PMF  since  it  can  be  shown  to  sum  to  one.  Attempting  to  find  the 
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expected  value  produces 


(X) 


E[X]  = 


k= 1 

4^1 

7 T2  k 

k= 1 


oo 


since  1/k  is  a  harmonic  series  which  is  known  not  to  be  summable  (meaning  that 
the  partial  sums  do  not  converge).  Hence,  the  random  variable  described  by  the 
PMF  of  (6.3)  does  not  have  a  finite  expected  value.  It  is  even  possible  for  a  sum 
YlT=-oc  kpx[k\  that  is  composed  of  positive  and  negative  terms  to  produce  different 
results  depending  upon  the  order  in  which  the  terms  are  added  together.  In  this 
case  the  value  of  the  sum  is  said  to  be  ambiguous.  These  difficulties  can  be  avoided, 
however,  if  we  require  the  sum  to  be  absolutely  summable  or  if  the  sum  of  the 
absolute  values  of  the  terms  is  finite  [Gaughan  1975].  Hence  we  will  say  that  the 
expected  value  exists  if 


(X) 


E[ \X\]  =  Y,  \k\ Px[k]  <  oo. 


k= oo 


In  Problem  6.6  a  further  discussion  of  this  point  is  given. 


Lastly,  note  the  following  properties  of  the  expected  value. 

1.  It  is  located  at  the  “center”  of  the  PMF  if  the  PMF  is  symmetric  about  some 

point  (see  Problem  6.7). 

2.  It  does  not  generally  indicate  the  most  probable  value  of  the  random  variable 

(see  Problem  6.8). 

3.  More  than  one  PMF  may  have  the  same  expected  value  (see  Problem  6.9). 


6.5  Expected  Value  for  a  Function  of  a  Random 
Variable 

The  expected  value  may  easily  be  found  for  a  function  of  a  random  variable  X  if  the 
PMF  px[%i\  is  known.  If  the  function  of  interest  is  Y  —  g(X),  then  by  the  definition 
of  expected  value 

e[y]  =  YyiPy[yi\-  (6-4) 

i 

But  as  shown  in  Appendix  6A  we  can  avoid  having  to  find  the  PMF  for  Y  by  using 
the  much  more  convenient  form 


E\g{X)\  =  Ys(xi)Px[xi.- 


(6.5) 
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Otherwise,  we  would  be  forced  to  determine py [yi]  from px[%i\  and  g(X)  using  (5.9). 
This  result  proves  to  be  very  useful,  especially  when  the  function  is  a  complicated 
one  such  as  g(x)  =  sin[(7r/2)a;]  (see  Problem  6.10).  Some  examples  follow. 

Example  6.1  —  A  linear  function 

If  g(X)  =  aX  +  6,  where  a  and  b  are  constants,  then 


E[g{X)] 


=  E[aX  +  b] 

=  y  {axj  +  b)px[xj\  (from  (6.5)) 

i 

=  a  Xipx  [xi]  +  b'*jTpx  [a:,] 

i  i 

=  aE[X]  +  b  (definition  of  E[X]  and  PMF  values  sum  to  one. 


In  particular,  if  we  set  a  =  1,  then  E[X  +  b]  =  E[X]  +  b.  This  allows  us  to  set  the 
expected  value  of  a  random  variable  to  any  desired  value  by  adding  the  appropriate 
constant  to  X.  Finally,  a  simple  extension  of  this  example  produces 


E[aigi(X)  +  ci2g2{X)]  —  a\E\gi(X)]  +  d2E[g2(X) 


for  any  two  constants  a±  and  <12  and  any  two  functions  g\  and  <72  (see  Problem  6.11). 
It  is  said  that  the  expectation  operator  E  is  linear. 

0 


Example  6.2  —  A  nonlinear  function 

Assume  that  X  has  a  PMF  given  by 

Px[k]  =  \  k  =  0,1, 2, 3, 4 
o 

and  determine  E[Y]  for  Y  =  g(X)  =  X2.  Then,  using  (6.5)  produces 


e[x 2]  = 

k= 0 

- 

k= 0 

=  6. 


/N\  It  is  not  true  that  E[g(X)\  =  g(E[X]). 

From  the  previous  example  with  g(X)  =  X2,  we  had  that  E[g(X)\  =  E[X2]  =  6  but 
g(E[X])  =  (E[X])2  =  22  =  4  ^  E[g(X)\.  It  is  said  that  the  expectation  operator 
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does  not  commute  (or  we  cannot  just  take  E[g(X)\  and  interchange  the  E  and  g)  for 
nonlinear  functions.  This  manipulation  is  valid,  however,  for  linear  (actually  affine) 
functions  as  Example  6.1  demonstrates.  Henceforth,  we  will  use  the  notation  E2[X] 
to  replace  the  more  cumbersome  ( E[X ])2. 


Example  6.3  —  Predicting  the  outcome  of  an  experiment 

It  is  always  of  great  interest  to  be  able  to  predict  the  outcome  of  an  experiment 
before  it  has  occurred.  For  example,  if  the  experiment  were  the  summer  rainfall  in 
Rhode  Island  in  the  coming  year,  then  a  farmer  would  like  to  have  this  information 
before  he  decides  upon  which  crops  to  plant.  One  way  to  do  this  is  to  check  the 
Farmer’s  almanac,  but  its  accuracy  may  be  in  dispute!  Another  approach  would  be 
to  guess  this  number  based  on  the  PMF  (statisticians,  however,  use  the  more  formal 
term  “predict”  or  “estimate”  which  sounds  better) .  Denoting  the  prediction  by  the 
number  6,  we  would  like  to  choose  a  number  so  that  on  the  average  it  is  close  to  the 
true  outcome  of  the  random  variable  X.  To  measure  the  error  we  could  use  x  —  6, 
where  x  is  the  outcome,  and  to  account  for  positive  and  negative  errors  equally  we 
could  use  ( x  —  b)2.  This  squared  error  may  at  times  be  small  and  at  other  times 
large,  depending  on  the  outcome  of  X.  What  we  want  is  the  average  value  of  the 
squared  error.  This  is  measured  by  E[(X  -  6)2],  and  is  termed  the  mean  square 
error  (MSE).  We  denote  it  by  mse(6)  since  it  will  depend  on  our  choice  of  b.  A 
reasonable  method  for  choosing  b  is  to  choose  the  value  that  minimizes  the  MSE. 
We  now  proceed  to  find  that  value  of  b. 


mse(6)  =  E[(X  —  6)2] 

=  E[X2  -  2bX  +  b2] 

=  E[X2}  -  2bE[X]  +  E[b2] 
=  E[X2]  -  2bE[X]  +  b2 


(linearity  of  E(-)) 

(expected  value  of  constant  is  the  constant). 


To  find  the  value  of  b  that  minimizes  the  MSE  we  need  only  differentiate  the  MSE, 
set  the  derivative  equal  to  zero,  and  solve  for  b.  This  is  because  the  MSE  is  a 
quadratic  function  of  b  whose  minimum  is  located  at  the  stationary  point.  Thus, 
we  have 


dmse(6) 

db 


-2  E[X]  +2b  =  0 


which  produces  the  minimizing  or  optimal  value  of  b  given  by  60pt  =  E[X].  Hence, 
the  best  predictor  of  the  outcome  of  an  experiment  is  the  expected  value  or  mean 
of  the  random  variable.  For  example,  the  best  predictor  of  the  outcome  of  a  die 
toss  would  be  3.5.  This  result  provides  another  interpretation  of  the  expected  value. 
The  expected  value  of  a  random  variable  is  the  best  predictor  of  the  outcome  of  the 
experiment ,  where  “best”  is  to  be  interpreted  as  the  value  that  minimizes  the  MSE. 

❖ 
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6.6  Variance  and  Moments  of  a  Random  Variable 


Another  function  of  a  random  variable  that  yields  important  information  about  its 
behavior  is  that  given  by  g{X)  =  (X  —  E[X])2.  Whereas  E[X]  measures  the  mean 
of  a  random  variable,  E[(X  —  jF[X])2]  measures  the  average  squared  deviation  from 
the  mean.  For  example,  a  uniform  discrete  random  variable  whose  PMF  is 


px[k\  = 


1 

2M  +  1 


k  =  -M,  -M  +  1, . . .  ,M 


is  easily  shown  to  have  a  mean  of  zero  for  any  M.  However,  as  seen  in  Figure  6.4  the 
variability  of  the  outcomes  of  the  random  variable  becomes  larger  as  M  increases. 
This  is  because  the  PMF  for  M  —  10  can  have  values  exceeding  those  for  M  —  2. 
The  variability  is  measured  by  the  variance  which  is  defined  as 


var (X)  =  E[(X  -  E[X])2]. 


Note  that  the  variance  is  always  greater  than  or  equal  to  zero.  It  is  determined  from 
the  PMF  using  (6.5)  with  g(X)  =  (X  —  E[X])2  to  yield 

var(X)  =  -  E[X])2px[xi}.  (6.7) 

i 

For  the  current  example,  E[X]  =  0  due  to  the  symmetry  of  the  PMF  about  k  =  0 
so  that 


var(X)  =  Y^,x2iPx[xi] 

i 

M  i 

V  k2  - 

^  2M+1 


k=-M 


M 


2  M  + 


tE*2- 

k= 1 


But  it  can  be  shown  that 


which  yields 


M 


E*2 

k= 1 


M(M  +  1)(2M  +  1) 
6 


var(X)  = 


2  M(M  +  1)(2M  +  1) 
2M  + 1  6 

M(M  +  1) 


Clearly,  the  variance  increases  with  M,  or  equivalently  with  the  width  of  the  PMF,  as 
is  also  evident  from  Figure  6.4.  We  next  give  another  example  of  the  determination 
of  the  variance  and  then  summarize  the  results  for  several  important  PMFs. 
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Trial  number  Trial  number 

(c)  Typical  outcomes,  M  =  2  (d)  Typical  outcomes,  M  =  10 

Figure  6.4:  Illustration  of  effect  of  width  of  PMF  on  variability  of  outcomes. 

Example  6.4  —  Variance  of  Bernoulli  random  variable 

If  X  ~  Ber(p),  then  since  E[X]  =  p,  we  have 

var(X)  =  Y^(Xi  -  E[X\)2px[xi] 

i 

1 

=  -  p)2px[k] 

k= 0 

=  (0-p)2(l  -p)  +  (1  -p)2p 

=  p(i-p). 

0 
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Values 

PMF 

E[X] 

var(X) 

Uniform 

k=—M,...,M 

1 

0 

M(M+ 1) 

sin[(2M+l)c<;/2] 

2M+1 

3 

(2M+1)  sin[cj/2] 

Bernoulli 

k= 0,1 

pk(  i  —  pY~k 

P 

p(l-p) 

pexp(ju;)+(l-p) 

Binomial 

{t)p\  l-p)"-* 

Mp 

Mp{l—p) 

[pexp{juj)+{l-p)]M 

Geometric 

k= 1,2,... 

(i  —  p)k_1p 

1 

1  —p 

p 

P 

p2 

exp(—  ju)- (1—  p) 

Poisson 

fc=0,l,... 

exp(— A)^r 

A 

A 

exp  [A  (exp  ( j  cj  )  —  1 )  ] 

Table  6.1:  Properties  of  discrete  random  variables. 


It  is  interesting  to  note  that  the  variance  is  minimized  and  equals  zero  if  p  —  0  or 
p  —  1.  Also,  it  is  maximized  for  p  —  1/2.  Can  you  explain  this?  Important  PMFs 
with  their  means,  variances,  and  characteristic  functions  (to  be  discussed  in  Section 
6.7)  are  listed  in  Table  6.1.  The  reader  is  asked  to  derive  some  of  these  entries  in 
the  Problems. 

An  alternative  useful  expression  for  the  variance  can  be  developed  based  on  the 
properties  of  the  expectation  operator.  We  have  that 

var(X)  =  E[(X  -  E[X})2} 

=  E[X2  -  2 XE[X]  +  E2[X}} 

=  E[X2]-2E[X]E  [X]  +  E2  [X] 

where  the  last  step  is  due  to  linearity  of  the  expectation  operator  and  the  fact  that 
E[X\  is  a  constant.  Hence 


var(X)  =  E[X2}  -  E2[X] 

and  is  seen  to  depend  on  E[X\  and  E[X2].  In  the  case  where  E[X]  —  0,  we  have 
the  simple  result  that  var(X)  =  E[X2].  This  property  of  the  variance  along  with 
some  others  is  now  summarized. 

Property  6.1  —  Alternative  expression  for  variance 


var(X)  =  E[X2]  -  E2[X] 


(6.8) 

□ 
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Property  6.2  —  Variance  for  random  variable  modified  by  a  constant 

For  c  a  constant 


var(c) 
var(X  +  c) 
var(cX) 


-  0 

=  var(X) 

=  c2var(X) 


□ 

The  reader  is  asked  to  verify  Property  6.2  in  Problem  6.21. 

The  expectations  E[X]  and  E[X 2]  are  called  the  first  and  second  moments  of 
X,  respectively.  The  term  moment  has  been  borrowed  from  physics,  where  E[X] 
is  called  the  center  of  mass  or  moment  of  mass  (see  also  Problem  6.1).  In  general, 
the  nth  moment  is  defined  as  E[Xn]  and  exists  (meaning  that  the  value  can  be 
determined  unambiguously  and  is  finite)  if  i£[|X|n]  is  finite.  The  latter  is  called  the 
n  absolute  moment.  It  can  be  shown  that  if  E[XS]  exists,  then  E[Xr ]  exists  for 
r  <  s  (see  Problem  6.23).  As  a  result,  if  E[X 2]  is  finite,  then  E[X]  exists  and  by 
(6.8)  the  variance  will  also  exist.  In  summary,  the  mean  and  variance  of  a  discrete 
random  variable  will  exist  if  the  second  moment  is  finite. 

A  variant  of  the  notion  of  moments  is  that  of  the  central  moments.  They  are 
defined  as  E[(X  —  E[X])n],  in  which  the  mean  is  first  subtracted  from  X  before  the 
n  moment  is  computed.  They  are  useful  in  assessing  the  average  deviations  from 
the  mean.  In  particular,  for  n  =  2  we  have  the  usual  definition  of  the  variance.  See 
also  Problem  6.26  for  the  relationship  between  the  moments  and  central  moments. 


Variance  is  a  nonlinear  operator. 


The  variance  of  a  random  variable  does  not  have  the  linearity  property  of  the 
expectation  operator.  Hence,  in  general 

var(#i(X)  +  g2{X))  =  vax(gi(X))  +  var(#2p0)  is  not  true. 

Just  consider  var(X  +  X),  where  E[X]  =  0  as  a  simple  example. 

A 

As  explained  previously,  an  alternative  interpretation  of  E[X]  is  as  the  best  predictor 
of  X.  Recall  that  this  predictor  is  the  constant  60pt  =  E[X]  when  the  mean  square 
error  is  used  as  a  measure  of  error.  We  wish  to  point  out  that  the  minimum  mse  is 
then 


msemin  —  E\fX  &opt)  ] 
=  E[(X  -  E[X])2] 
=  var(X). 


(6.9) 
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Thus,  how  well  we  can  predict  the  outcome  of  an  experiment  depends  on  the  variance 
of  the  random  variable.  As  an  example,  consider  a  coin  toss  with  a  probability  of 
heads  ( X  =  1)  of  p  and  of  tails  ( X  =  0)  of  1  —  p,  i.e.,  a  Bernoulli  random  variable. 
We  would  predict  the  outcome  of  X  to  be  &0pt  —  H[X]  =  p  and  the  minimum  mse  is 
the  variance  which  from  Example  6.4  is  msemin  =  p(  1  —p).  This  is  plotted  in  Figure 
6.5  versus  p.  It  is  seen  that  the  minimum  mse  is  smallest  when  p  =  0  or  p  =  1  and 
largest  when  p  =  1/2,  or  most  predictable  for  p  =  0  and  p  =  1  and  least  predictable 
for  p  —  1/2.  Can  you  explain  this? 


Figure  6.5:  Measure  of  predictability  of  the  outcome  of  a  coin  toss. 


6.7  Characteristic  Functions 

Determining  the  moments  E[Xn]  of  a  random  variable  can  be  a  difficult  task  for 
some  PMFs.  An  alternative  method  that  can  be  considerably  easier  is  based  on 
the  characteristic  function.  In  addition,  the  characteristic  function  can  be  used  to 
examine  convergence  of  PMFs,  as,  for  example,  in  the  convergence  of  the  binomial 
PMF  to  the  Poisson  PMF,  and  to  determine  the  PMF  for  a  sum  of  independent 
random  variables,  which  will  be  examined  in  Chapter  7.  In  this  section  we  discuss  the 
use  of  the  characteristic  function  for  the  calculation  of  moments  and  to  investigate 
the  convergence  of  a  PMF. 

The  characteristic  function  of  a  random  variable  X  is  defined  as 

<l>x(v)  =  E[exp(ju}X)]  (6.10) 

where  j  is  the  square  root  of  —1  and  where  uj  takes  on  a  suitable  range  of  values. 
Note  that  the  function  g(X )  =  exp (jujX)  is  complex  but  by  defining  E[g(X)]  = 
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E[cos(ujX)  +j  sin(wX)]  =  £J[cos(o;X)]  +  jE[sin(ajX)],  we  can  apply  (6.5)  to  the  real 
and  imaginary  parts  of  (px  (w)  to  yield 


<px{u)  =  E[exp{juiX)} 

=  £J[cos(o;X)  +  j  sin(o;X)] 

=  i?[cos(o;X)]  +  j.E[sm(u;X)] 

=  2J  cos  (uxi  )px  [®i]  +  j  ^2  sm(u)Xi  )px  [xi] 


’Y^Qxp(jidxl)px[xl\. 


(6.11) 


To  simplify  the  discussion,  yet  still  be  able  to  apply  our  results  to  the  important 
PMFs,  we  assume  that  the  sample  space  Sx  is  a  subset  of  the  integers.  Then  (6.11) 
becomes 

oo 

4>x{u)  =  ^2  exp(juk)px[k] 

k— — oo 


or  rearranging 

oo 

4>x{u)  =  Px[k]  exp(jcdk) 
k=— oo 


(6.12) 


where  px  [ k ]  =  0  for  those  integers  not  included  in  Sx  •  For  example,  in  the  Poisson 
PMF  the  range  of  summation  in  (6.12)  would  be  k  >  0.  In  this  form,  the  char¬ 
acteristic  function  is  immediately  recognized  as  being  the  Fourier  transform  of  the 
sequence  px[k]  for  — oo  <  k  <  oo.  Its  definition  is  slightly  different  than  the  usual 
Fourier  transform,  called  the  discrete-time  Fourier  transform,  which  uses  the  func¬ 
tion  exp(—  jcjk)  in  its  definition  [Jackson  1991].  As  a  Fourier  transform,  it  exhibits 
all  the  usual  properties.  In  particular,  the  Fourier  transform  of  a  sequence  is  pe¬ 
riodic  with  period  of  2n  (see  Property  6.4  for  a  proof).  As  a  result,  we  need  only 
examine  the  characteristic  function  over  the  interval  —  n  <  u  <  7r,  which  is  defined 
to  be  the  fundamental  period.  For  our  purposes  the  most  useful  property  is  that  we 
can  differentiate  the  sum  in  (6.12)  “term  by  term”  or 


d(f>x  M 
du 


_d_ 

duo 


oo 

^2  Px[k]exp(juk) 
k=—oo 


The  utility  in  doing  so  is  to  produce  a  formula  for  E[X] .  Carrying  out  the  differen¬ 
tiation 


d(px  M 
du 


OO 

Px  [k]jk  exp(jcok) 

k— — oo 
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so  that 


1  <Uf>x(u ) 
7  duo 


oo 


=  kpxik] 

k=  —  oo 

=  E[X]- 

In  fact,  repeated  differentiation  produces  the  formula  for  the  nth  moment  as 


pryni  1  dn<j>x{u) 
E[X  1  = 


(6.13) 


u=0 


All  the  moments  that  exist  may  be  found  by  repeated  differentiation  of  the  charac¬ 
teristic  function.  An  example  follows. 

Example  6.5  —  First  two  moments  of  geometric  random  variable 

Since  the  PMF  for  a  geometric  random  variable  is  given  by  px[k]  =  (1  —  p)k~lp  for 
k  =  1, 2, . . .,  we  have  that 


oo 


<t>x(u)  =  y^ypx[k]exp(ju}k) 


k= l 

OO 


=  51(1  -p)k  1pexp(ju:k ) 


fc=i 


OO 


=  pexp(jw)55[(1  ~P)  exp{juj)}k  1 

fc= l 

But  since  |(1  —  p)  exp(jfo;)|  <  1,  we  can  use  the  result 


OO  OO  1 

E**-1  =  £**  =  r= 

k= 1  k= 0 


for  z  a  complex  number  with  \z\  <  1  to  yield  the  characteristic  function 

P  exp  (ju) 


(f>x(w) 


1  -  [(1  -p)exp(jw)] 
P 


(6.14) 


exp (-ju)  -  (1  -p)' 

Note  that  as  claimed  the  characteristic  function  is  periodic  with  period  2tt.  To  find 
the  mean  we  use  (6.13)  with  n  =  1  to  produce 


=  i#iH 

j  duo 
1 


u=0 

-j  exp  (-ju) 


jP^  ^  [exp (-ju)  -  (1  -  p)]2 

1  3  1 

-Pl2  = 


u=Q 


3  P‘ 


P 


(6.15) 

(6.16) 
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which  agrees  with  our  earlier  results  based  on  using  the  definition  of  expected  value. 
To  find  the  second  moment  and  hence  the  variance  using  (6.8) 


E[X 2] 


J_  d2<f>x  M 

j2  doj2 
p  d 


U= 0 

exp(-ju>) 


Cel— 0 


j  du  [exp (-jw)  -  (1  -  p )]2 

=  P  D2(~j)  exp (-ju)  -  exp(—juj)2D(—j)  exp (-ju) 
j  D 4 

where  D  —  exp(—  jw)  —  (1  —  p).  Since  D|w=o  =  p,  we  have  that 


E[X 2]  = 


(from  (6.15)) 


w=0 


pz  p 


so  that  finally  we  have 


var(X) 


E[X2]  - 
2_  _  1 

p2  p 

1  ~P 

p2  ' 


E2[X] 

p2 


As  a  second  example,  we  consider  the  binomial  PMF. 

Example  6.6  -  Expected  value  of  binomial  PMF 

We  first  determine  the  characteristic  function  as 


OO 


4>x  M  =  ^2  PxW\  exp(jwfc) 


k=— oo 
M 


=  M  \  „k 


pK(l—p)M  fcexp(ja;A;) 


M 


=  E 


fc=0 


k 


M 

k 


-  - 

k 

- 

pexp(juj) 

a 

.  b  . 

-I  M-k 


(a  +  b)M 

[pexp(ja;)  +  (1  -p))M. 


❖ 


(6.17) 


(binomial  theorem) 


(6.18) 
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The  expected  value  then  follows  as 


E[X] 


1  d<j>x(u) 

3  |w=o 

-M  [pexp(jw)  +  (1  —  p)]M~lpj  exp(jm)  |w=0 
Mp 


which  is  in  agreement  with  our  earlier  results.  The  variance  can  be  found  by  using 
(6.8)  and  (6.13)  for  n  —  2.  It  is  left  as  an  exercise  to  the  reader  to  show  that  (see 
Problem  6.29) 

var(X)  =  Mp(  1  —p). 


0 

The  characteristic  function  for  the  other  important  PMFs  are  given  in  Table  6.1. 
Some  important  properties  of  the  characteristic  function  are  listed  next. 

Property  6.3  -  Characteristic  function  always  exists  since  |</>x(^)|  <  oo 
Proof: 


4>x  M I 


oo 

=  ^2  px[k}exp(jtok) 

k— — oo 
oo 

<  ^2  \px[k]exp(juk)\ 

k— — oo 
oo 

=  E  iprfii 

k— — oo 
oo 

=  Y2  px  w 

k=— oo 

=  1. 


(magnitude  of  sum  of  complex  numbers 
cannot  exceed  sum  of  magnitudes) 

(|exp(jwfc)|  =  1) 


(Px  [&]  >  0) 


Property  6.4  -  Characteristic  function  is  periodic  with  period  2n. 

Proof:  For  m  an  integer 


(f>x  {w  +  27r  m) 


oo 

Y.  Px[k]exp[j(u +  2irm)k] 

k— — oo 
oo 

^  px  [k]  exp  [juk]  exp  [j  27rm  k\ 

k=— oo 


oo 

Y,  Px[k]exp[juk] 

k— — oo 

4>x  (w). 


(since  exp(j2irmk)  =  1 
for  mk  an  integer) 


□ 
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Property  6.5  —  The  PMF  may  be  recovered  from  the  characteristic 
function. 

Given  the  characteristic  function,  we  may  determine  the  PMF  using 


px[k] 


(f>x(u)exp(-ju;k) 


duo 

2n 


—  oc  <  k  <  oo. 


(6.19) 


Proof:  Since  the  characteristic  function  is  the  Fourier  transform  of  a  sequence  (al¬ 
though  its  definition  uses  a  +j  instead  of  the  usual  — j),  it  has  an  inverse  Fourier 
transform.  Although  any  interval  of  length  2i r  may  be  used  to  perform  the  integra¬ 
tion  in  the  inverse  Fourier  transform,  it  is  customary  to  use  [ — tt,  n]  which  results  in 
(6.19). 

□ 


Property  6.6  —  Convergence  of  characteristic  functions  guarantees 
convergence  of  PMFs. 

This  property  says  that  if  we  have  a  sequence  of  characteristic  functions,  say  <f>^  (cj), 
which  converges  to  a  given  characteristic  function,  say  < f>x( w),  then  the  correspond¬ 
ing  sequence  of  PMFs,  say  Px\k],  must  converge  to  a  given  PMF  say  px[k\,  where 
px[k]  is  given  by  (6.19).  The  importance  of  this  theorem  is  that  it  allows  us  to 
approximate  PMFs  by  simpler  ones  if  we  can  show  that  the  characteristic  functions 
are  approximately  equal.  An  illustration  is  given  next.  This  theorem  is  known  as 
the  continuity  theorem  of  probability.  Its  proof  is  beyond  the  scope  of  this  text  but 
can  be  found  in  [Pollard  2002]. 

□ 

We  recall  the  approximation  of  the  binomial  PMF  by  the  Poisson  PMF  under  the 
conditions  that  p  — »  0  and  M  — »  oo  with  Mp  —  A  fixed  (see  Section  5.6).  To  show 
this  using  the  characteristic  function  approach  (based  on  Property  6.6)  we  let  X & 
denote  a  binomial  random  variable.  Its  characteristic  function  is  from  (6.18) 

4>xb(u)  =  [pexp(juj)  +  (1  -p)]M 


and  replacing  p  by  A/M  we  have 


4>x  b  M  = 


A  ,  .  /  A 

exp  (juj)  +  1  - 


-i  M 


M 


1  + 


A(exppj)  _  i) 

M 


M 

n  M 


exp[A(exp(jc<;)  -  1)] 


(see  Problem  5.15,  results  are  also 
valid  for  a  complex  variable) 
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as  M  ->  oo.  For  a  Poisson  random  variable  Xp  we  have  that 


OO 


Afc 


4>Xp  (w)  =  5^exP(  -A)— exp  (juk) 


k= 0 


OO 


exp(-A) 


k=0 


k\ 

[Aexp(jfw)]fc 

k\ 


=  exp(— A)  exp[A  exp  (jo;)] 
=  exp  [A  (exp  (jo;)  -  1)]. 


(using  results  from  Problem 
5.22  which  also  hold  for  a 
complex  variable) 


Since  (f>xb{oj)  (f)xP{w)  as  M  -»  oo,  by  Property  6.6,  we  must  have  that  px b[k]  -* 
PxP[k\  for  all  k.  Hence,  under  the  stated  conditions  the  binomial  PMF  becomes  the 
Poisson  PMF  as  M  — >  oo.  This  was  previously  proven  by  other  means  in  Section 
5.6.  Our  derivation  here  though  is  considerably  simpler. 


6.8  Estimating  Means  and  Variances 

As  alluded  to  earlier,  an  important  aspect  of  the  mean  and  variance  of  a  PMF  is 
that  they  are  easily  estimated  in  practice.  We  have  already  briefly  discussed  this  in 
Chapter  2  where  it  was  demonstrated  how  to  do  this  with  computer  simulated  data 
(see  Example  2.3).  We  now  continue  that  discussion  in  more  detail.  To  illustrate 
the  approach  we  will  consider  the  PMF  shown  in  Figure  6.6a.  Since  the  theoretical 


k 


Trial  number 


(a)  PMF 


(b)  Simulated  data 


Figure  6.6:  PMF  and  computer  generated  data  used  to  illustrate  estimation  of  mean 
and  variance. 
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expected  value  or  mean  is  given  by 


5 

E[X]  =  YJkpx[k] 

k= 1 


then  by  the  relative  frequency  interpretation  of  probability  we  can  use  the  approxi¬ 
mation 


px[k\  « 


Nk 

N 


where  Nk  is  the  number  of  trials  in  which  a  k  was  the  outcome  and  N  is  the  total 
number  of  trials.  As  a  result,  we  can  estimate  the  mean  by 


E[X]  =  £ 


k= 1 


kNk 

N  ' 


The  “hat”  will  always  denote  an  estimated  quantity.  But  kNk  is  just  the  sum  of  all 
the  k  outcomes  that  appear  in  the  N  trials  and  therefore  Yll=i  kNk  is  the  sum  of 
all  the  outcomes  in  the  N  trials.  Denoting  the  latter  by  ^i=i  ®*>  we  have  as  our 
estimate  of  the  mean 

_  1  N 

ew  =nJ2Xi  (6-2°) 

where  X{  is  the  outcome  of  the  ith  trial.  Note  that  we  have  just  reversed  our  line  of 
reasoning  used  in  the  introduction  to  motivate  the  use  of  E[X\  as  the  definition  of 
the  expected  value  of  a  random  variable.  Also,  we  have  previously  seen  this  type  of 
estimate  in  Example  2.3  where  it  was  referred  to  as  the  sample  mean.  It  is  usually 
denoted  by  x.  For  the  data  shown  in  Figure  6.6b  we  plot  the  sample  mean  in  Figure 

6.7a  versus  N.  Note  that  as  N  becomes  larger,  we  have  that  E[X\  — >  3  =  E[X]. 
The  true  variance  of  the  PMF  shown  in  Figure  6.6a  is  computed  as 

var(X)  =  E[X2]-E2[X] 

5 

=  E  k2Px  [k]  -  E2[X] 

k= 1 

which  is  easily  shown  to  be  var(A)  =  1.2.  It  is  estimated  as 

var(X)  =  E\X 2]  -  {E\X]f 


and  by  the  same  rationale  as  before  we  use 
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so  that  our  estimate  of  the  variance  becomes 


(6.21) 


This  estimate  is  shown  in  Figure  6.7b  as  a  function  of  N.  Note  that  as  the  number  of 
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(a)  Estimated  mean 


(b)  Estimated  variance 


Figure  6.7:  Estimated  mean  and  variance  for  computer  data  shown  in  Figure  6.6. 

trials  increases  the  estimate  of  variance  converges  to  the  true  value  of  var(X)  =  1.2. 
The  MATLAB  code  used  to  generate  the  data  and  estimate  the  mean  and  variance 
is  given  in  Appendix  6B.  Also,  in  that  appendix  is  listed  the  MATLAB  subprogram 
PMFdata.m  which  allows  easier  generation  of  the  outcomes  of  a  discrete  random 
variable.  In  practice,  it  is  customary  to  use  (6.20)  and  (6.21)  to  analyze  real-world 
data  as  a  first  step  in  assessing  the  characteristics  of  an  unknown  PMF. 


6.9  Real-World  Example  —  Data  Compression 

The  digital  revolution  of  the  past  20  years  has  made  it  commonplace  to  record  and 
store  information  in  a  digital  format.  Such  information  consists  of  speech  data  in 
telephone  transmission,  music  data  stored  on  compact  discs,  video  data  stored  on 
digital  video  discs,  and  facsimile  data,  to  name  but  a  few.  The  amount  of  data 
can  become  quite  large  so  that  it  is  important  to  be  able  to  reduce  the  amount  of 
storage  required.  The  process  of  storage  reduction  is  called  data  compression.  We 
now  illustrate  how  this  is  done.  To  do  so  we  simplify  the  discussion  by  assuming 
that  the  data  consists  of  a  sequence  of  the  letters  A,  B,  C,  D.  One  could  envision 
these  letters  as  representing  the  chords  of  a  rudimentary  musical  instrument,  for 
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example.  The  extension  to  the  entire  English  alphabet  consisting  of  26  letters  will 
be  apparent.  Consider  a  typical  sequence  of  50  letters 


AAAAAAAAAAABAAAAAAAAAAAAA 
AAAAAACABADAABAAABAAAAAAD . 


To  encode  these  letters  for  storage  we  could  use  the  two-bit  code 

A  -*  00 
B  -)•  01 
C  -*  10 

D  ->•  11  (6.23) 

which  would  then  require  a  storage  of  2  bits  per  letter  for  a  total  storage  of  100 
bits.  However,  as  seen  above  the  typical  sequence  is  characterized  by  a  much  larger 
probability  of  observing  an  “A”  as  opposed  to  the  other  letters.  In  fact,  there  are 
43  A’s,  4  B’s,  1  C,  and  2  D’s.  It  makes  sense  then  to  attempt  a  reduction  in  storage 
by  assigning  shorter  code  words  to  the  letters  that  occur  more  often ,  in  this  case,  to 
the  “A” .  As  a  possible  strategy,  consider  the  code  assignment 


A  — y  0 
B  -*  10 
C  -»•  110 

D  ->  111.  (6.24) 


Using  this  code  assignment  for  our  typical  sequence  would  require  only  1  •  43  +  2  • 
4  +  3-  l  +  3-  2  =  60  bits  or  1.2  bits  per  letter.  The  code  given  by  (6.24)  is  called 
a  Huffman  code.  It  can  be  shown  to  produce  less  bits  per  letter  “on  the  average” 
[Cover,  Thomas  1991]. 

To  determine  actual  storage  savings  we  need  to  determine  the  average  length  of 
the  code  word  per  letter.  First  we  define  a  discrete  random  variable  that  measures 
the  length  of  the  code  word.  For  the  sample  space  S  =  {A,  B,  C,  D}  we  define  the 
random  variable 


X(*i) 


(  1  s  i  =  A 

I  2  s2  =  B 

|  3  s3  =  C 

[3  54  =  D 


which  yields  the  code  length  for  each  letter.  The  probabilities  used  to  generate  the 
sequence  of  letters  shown  in  (6.22)  are  P[A]  =  7/8,  P[B]  =  1/16,  P[C]  =  1/32, 
P[D]  =  1/32.  As  a  result  the  PMF  for  X  is 


7 

8 
1 

16 

X 

16 


Px[k]  = 


Jfe  =  1 
k  =  2 
k  =  3. 
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The  average  code  length  is  given  by 


E[X] 


3 

^2kpx[k] 

k— 1 


7 

8 


+  2- 


1 

16 


1.1875  bits  per  letter. 


This  results  in  a  compression  ratio  of  2  :  1.1875  =  1.68  or  we  require  about  40%  less 
storage. 

It  is  also  of  interest  to  note  that  the  average  code  word  length  per  letter  can  be 
reduced  even  further.  However,  it  requires  more  complexity  in  coding  (and  of  course 
in  decoding).  A  fundamental  theorem  due  to  Shannon,  who  in  many  ways  laid  the 
groundwork  for  the  digital  revolution,  says  that  the  average  code  word  length  per 
letter  can  be  no  less  than  [Shannon  1948] 


4 

H  =  2  P[Si]  lo§2 

i— 1 


bits  per  letter. 


(6.25) 


This  quantity  is  termed  the  entropy  of  the  source.  In  addition,  he  showed  that  a 
code  exists  that  can  attain,  to  within  any  small  deviation,  this  minimum  average 
code  length.  For  our  example,  the  entropy  is 

H  =  l l0g2  V/8  +  h l0g2  1/16  +  4  l0g2  1732  +  ^ l0g2  1J32 
=  0.7311  bits  per  letter. 

Hence,  the  potential  compression  ratio  is  2  :  0.7311  =  2.73  for  about  a  63%  reduc¬ 
tion. 

Clearly,  it  is  seen  from  this  example  that  the  amount  of  reduction  will  depend 
critically  upon  the  probabilities  of  the  letters  occuring.  If  they  are  all  equally  likely 
to  occur,  then  the  minimum  average  code  length  is  from  (6.25)  with  P[s4]  =  1/4 

il0§2 

In  this  case  no  compression  is  possible  and  the  original  code  given  by  (6.23)  will  be 
optimal.  The  interested  reader  should  consult  [Cover  and  Thomas  1991]  for  further 
details. 


1/4 


=  2 


bits  per  letter. 


H  =  4 
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Problems 


6.1  (w)  The  center  of  mass  of  a  system  of  masses  situated  on  a  line  is  the  point  at 
which  the  system  is  balanced.  That  is  to  say  that  at  this  point  the  sum  of 
the  moments,  where  the  moment  is  the  distance  from  center  of  mass  times  the 
mass,  is  zero.  If  the  center  of  mass  is  denoted  by  CM,  then 

M 

y^Xxi  -  CM )m*  =  0 

i—l 

where  X{  is  the  position  of  the  ith  mass  along  the  x  direction  and  mi  is  its 
corresponding  mass.  First  solve  for  CM.  Then,  for  the  system  of  weights 
shown  in  Figure  6.8  determine  the  center  of  mass.  How  is  this  analogous  to 
the  expected  value  of  a  discrete  random  variable? 


10  kg  10  kg 

J  1 

10  kg 

L 

10  kg 

L 

0  1  5 

10  ' 
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i 

if  mass 

20 

x  (meters) 


Figure  6.8:  Weightless  bar  supporting  four  weights. 


6-2  (o)  (f)  For  the  discrete  random  variable  with  PMF 

px[k]  =  jQ  k  =  0,1,...,  9 
find  the  expected  value  of  X. 

6.3  (w)  A  die  is  tossed.  The  probability  of  obtaining  a  1,  2,  or  3  is  the  same.  Also, 
the  probability  of  obtaining  a  4,  5,  or  6  is  the  same.  However,  a  5  is  twice  as 
likely  to  be  observed  as  a  1.  For  a  large  number  of  tosses  what  is  the  average 
value  observed? 
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6.4  (^)  (f)  A  coin  is  tossed  with  the  probability  of  heads  being  2/3.  A  head  is 

mapped  into  X  =  1  and  a  tail  into  X  =  0.  What  is  the  expected  outcome  of 
this  experiment? 

6.5  (f )  Determine  the  expected  value  of  a  Poisson  random  variable.  Hint:  Differ¬ 

entiate  with  respect  to  A. 

6.6  (t)  Consider  the  PMF  px[k]  =  (2/7 v)/k2  for  k  —  ... ,  — 1, 0,1, _ The  expected 

value  is  defined  as 

00 

E[X]  =  Y  kpx[k } 

k=— 00 

which  is  actually  shorthand  for 

Nv 

EW  =  E  kPx[k] 

Njj—t 00  k—N^ 

where  the  L  and  U  represent  “lower”  and  “upper” ,  respectively.  This  may  be 
written  as 

-1  Nv 

E[X]  =  .. lim  Y  kpx M  +  lim  y ]kpx[k\ 

where  the  limits  are  taken  independently  of  each  other.  For  E[X ]  to  be  un¬ 
ambiguous  and  finite  both  limits  must  be  finite.  As  a  result,  show  that  the 
expected  value  for  the  given  PMF  does  not  exist.  If,  however,  we  were  to  con¬ 
strain  Nl  —  Njj,  show  that  the  expected  value  is  zero.  Note  that  if  Nk  =  Njj, 
we  are  reordering  the  terms  before  performing  the  sum  since  the  partial  sums 
become  Yk=-ikpx[k],  X^a-=-2 kpx[k],  etc.  But  for  the  expected  value  to  be 
unambiguous,  the  value  should  not  depend  on  the  ordering.  If  a  sum  is  abso¬ 
lutely  summable,  any  ordering  will  produce  the  same  result  [Gaughan  1975], 
hence  our  requirement  for  the  existence  of  the  expected  value. 

6.7  (t)  Assume  that  a  discrete  random  variable  takes  on  the  values  &  =  1,0, 1, . . . 

and  that  its  PMF  satisfies  px[m  +  i]  =  Px[m  —  {],  where  m  is  a  fixed  integer 
and  i  =  1,2,....  This  says  that  the  PMF  is  symmetric  about  the  point  x  =  m. 
Prove  that  the  expected  value  of  the  random  variable  is  E[X]  —  m. 

6.8  (o)  00  Give  an  example  where  the  expected  value  of  a  random  variable  is  not 

its  most  probable  value. 

6.9  (t)  Give  an  example  of  two  PMFs  that  have  the  same  expected  value. 

6.10  (f)  A  discrete  random  variable  X  has  the  PMF  px[k]  =  1/5  for  k  —  0, 1, 2, 3,4. 

If  Y  —  sin[(7r/2)X],  find  E[Y]  using  (6.4)  and  (6.5).  Which  way  is  easier? 
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6-11  (t)  Prove  the  linearity  property  of  the  expectation  operator 

E[al9 i(X)  +  a2g2{X)\  =  a^g^X)}  +  a2E[g2(X )] 
where  a\  and  a 2  are  constants. 

6.12  (v^,)  (f)  Determine  E[X 2]  for  a  geom(p)  random  variable  using  (6.5).  Hint: 
You  will  need  to  differentiate  twice. 

6.13  (o)  (t)  Can  E[X 2]  ever  be  equal  to  E2[X}1  If  so,  when? 

6.14  (o)  (w)  A  discrete  random  variable  X  has  the  PMF 

r  \  *  =  1 

8  k  —  2 
8  k  =  3 

A  k  =  4- 

If  the  experiment  that  produces  a  value  of  X  is  conducted,  find  the  minimum 
mean  square  error  predictor  of  the  outcome.  What  is  the  minimum  mean 
square  error  of  the  predictor? 

6.15  (^)  (c)  For  Problem  6.14  use  a  computer  to  simulate  the  experiment  for 
many  trials.  Compare  the  estimate  to  the  actual  outcomes  of  the  computer 
experiment.  Also,  compute  the  minimum  mean  square  error  and  compare  it 
to  the  theoretical  value  obtained  in  Problem  6.14. 

6.16  (w)  Of  the  three  PMFs  shown  in  Figure  6.9,  which  one  has  the  smallest  vari¬ 
ance?  Hint:  You  do  not  need  to  actually  calculate  the  variances. 


(a)  (b)  (c) 

Figure  6.9:  PMFs  for  Problem  6.16. 


6.17  (w)  If  Y  —  aX  +  6,  what  is  the  variance  of  Y  in  terms  of  the  variance  of  A? 
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6.18  (f)  Find  the  variance  of  a  Poisson  random  variable.  See  the  hint  for  Problem 
6.12. 

6.19  (f)  For  the  PMF  given  in  Problem  6.2  find  the  variance. 

6.20  (^)  (f)  Find  the  second  moment  for  a  Poisson  random  variable  by  using  the 
characteristic  function,  which  is  given  in  Table  6.1. 

6.21  (t)  If  X  is  a  discrete  random  variable  and  c  is  a  constant,  prove  the  following 
properties  of  the  variance: 

var(c)  =  0 
var(X  +  c)  =  var(X) 
var(cX)  =  c2var(X). 

6.22  (t)  If  a  discrete  random  variable  X  has  var(X)  =  0,  prove  that  X  must  be 
a  constant  c.  This  provides  a  converse  to  the  property  that  if  X  =  c,  then 
var(X)  =  0. 

6.23  (t)  In  this  problem  we  prove  that  if  E[XS]  exists,  meaning  that  E[|X|5]  <  oo, 
then  E[Xr]  also  exists  for  0  <  r  <  s.  Provide  the  explanations  for  the  following 
steps: 

a.  For  \x\  <  1,  \x\r  <  1 

b.  For  \x\  >  1,  \x\r  <  |x|5 

c.  For  all  |#|,  \x\r  <  \x\s  +  1 

d.  E[ \X\r]  =  \xi\rpx[xi]  <  Eid^il5  +  1  )Px[xi]  =  E[\X\S]  +  1  <  oo. 

6.24  (f)  If  a  discrete  random  variable  has  the  PMF  px[k]  —  1/4  for  k  =  —  1  and 
Px[k]  =  3/4  for  k  =  1,  find  the  mean  and  variance. 

6.25  (t)  A  symmetric  PMF  satisfies  the  relationship  px[~ k]  =  px[k]  for  k  = 
...,—1,0,1, —  Prove  that  all  the  odd  order  moments,  E[Xn]  for  n  odd, 
are  zero. 

6.26  (0)  (t)  A  central  moment  of  a  discrete  random  variable  is  defined  as 
E[(X  —  E[X])n],  for  n  a  positive  integer.  Derive  a  formula  that  relates  the 
central  moment  to  the  usual  moments.  Hint:  You  will  need  the  binomial 
formula. 

6.27  (^ )  (t)  If  Y  =  aX  +  b,  find  the  characteristic  function  of  Y  in  terms  of  that 
for  X.  Next  use  your  result  to  prove  that  E\Y]  —  aE[X]  +  b. 

6.28  (o)  (f)  Find  the  characteristic  function  for  the  PMF  px[k]  =  1/5  for  k  = 

-2, -1,0, 1,2. 
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6.29  (f)  Determine  the  variance  of  a  binomial  random  variable  by  using  the  prop¬ 
erties  of  the  characteristic  function.  You  can  assume  knowledge  of  the  char¬ 
acteristic  function  for  a  binomial  random  variable. 

6.30  (f)  Determine  the  mean  and  variance  of  a  Poisson  random  variable  by  using 
the  properties  of  the  characteristic  function.  You  can  assume  knowledge  of 
the  characteristic  function  for  a  Poisson  random  variable. 

6.31  (f)  Which  PMF  px[k]  for  k  =  ... ,  —1,0, 1, . . .  has  the  characteristic  function 
4>x  (^)  =  cos  a ;? 

6.32  (0)(c)  For  the  random  variable  described  in  Problem  6.24  perform  a  com¬ 
puter  simulation  to  estimate  its  mean  and  variance.  How  does  it  compare  to 
the  true  mean  and  variance? 


Appendix  6A 


Derivation  of  E\g(X)\  Formula 


Assume  that  A  is  a  discrete  random  variable  taking  on  values  in  Sx  =  {#1,  •  •  •} 

with  PMF  px[%i]-  Then,  if  Y  =  g(X)  we  have  from  the  definition  of  expected  value 

m = £  ViPYiVi]  (6A.1) 

i 

where  the  sum  is  over  all  yi  G  <Sy.  Note  that  it  is  assumed  that  the  yi  are  distinct 
(all  different).  But  from  (5.9) 


Py[Vi\  =  ^2  Px[xi\-  (6A.2) 

{xJ:g(x])=yi} 

To  simplify  the  notation  we  will  define  the  indicator  function ,  which  indicates 
whether  a  number  x  is  within  a  given  set  A,  as 


Ia(x)  - 


l  x  e  A 
0  otherwise. 


Then  (6A.2)  can  be  rewritten  as 

oo 

pyM  =  -  9{xj)) 

3= 1 

since  the  sum  will  include  the  term  px[xj]  only  if  yi  —  g(xj)  =  0.  Using  this,  we 
have  from  (6A.1) 


E[Y 


OO 


j\I{0}(Vi  ~  9(xj)) 

i  j= 1 


oo 


=  £ 


j= 1  L  i 


^2yiI{0}(Vi  ~9(xj)) 


Px[xj]. 
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Now  for  a  given  j ,  g(xj)  is  a  fixed  number  and  since  the  7/,’s  are  distinct,  there  is 
only  one  yt  for  which  y,  =  g(xj).  Thus,  we  have  that 

^2vil{0}(yi  - g(xj))  =  g(xj) 

i 

and  finally 

oo 

E[Y]  =  E[g(X)]  =  Y^9(xj)px[xj]. 

3= 1 


Appendix  6B 

MATLAB  Code  Used  to 
Estimate  Mean  and  Variance 


Figures  6.6  and  6.7  are  based  on  the  following  MATLAB  code. 

°/0  PMFdata.m 

7. 

7.  This  program  generates  the  outcomes  for  N  trials 
°/0  of  an  experiment  for  a  discrete  random  variable. 

°/0  Uses  the  method  of  Section  5.9. 

°/.  It  is  a  function  subprogram. 

7. 

°/«  Input  parameters : 

°/. 

°/0  N  -  number  of  trials  desired 

#/.  xi  -  values  of  x_i*s  of  discrete  random  variable  (M  x  1  vector) 

#/0  pX  -  PMF  of  discrete  random  variable  (M  x  1  vector) 

°/. 

°/o  Output  parameters : 

#/. 

°/«  x  -  outcomes  of  N  trials  (N  x  1  vector) 

1 

function  x=PMFdata(N,xi ,pX) 

M=length(xi) ;M2=length(pX) ; 
if  M~=M2 

message^  xi  and  pX  must  have  the  same  dimension 3 
end 

for  k=l:M  ;  °/0  see  Section  5.9  and  Figure  5.14  for  approach  used  here 
if  k==l 
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bin(k, l)=pX(k)  ;  7®  set  up  first  interval  of  CDF  as  [0,pX(l)] 
else 

bin(k, l)=bin(k-l , l)+pX(k)  ;  7#  set  up  succeeding  intervals 

7.  of  CDF 

end 

end 

u=rand(N,l);  7®  generate  N  outcomes  of  uniform  random  variable 
for  i=l:N  7,  determine  which  interval  of  CDF  the  outcome  lies  in 

7*  and  map  into  value  of  xi 
if  u(i)>0&u(i)<=bin(l) 
x(i,l)=xi(l) ; 

end 

for  k=2:M 

if  u(i)>bin(k-l)&u(i)<=bin(k) 
x(i,l)=xi(k) ; 

end 

end 

end 


Chapter  7 


Multiple  Discrete  Random 
Variables 

7.1  Introduction 

In  Chapter  5  we  introduced  the  concept  of  a  discrete  random  variable  as  a  mapping 
from  the  sample  space  S  =  {<s*}  to  a  countable  set  of  real  numbers  (either  finite 
or  countably  infinite)  via  a  mapping  X(<s*).  In  effect,  the  mapping  yields  useful 
numerical  information  about  the  outcome  of  the  random  phenomenon.  In  some 
instances,  however,  we  would  like  to  measure  more  than  just  one  attribute  of  the 
outcome.  For  example,  consider  the  choice  of  a  student  at  random  from  a  population 
of  college  students.  Then,  for  the  purpose  of  assessing  the  student’s  health  we  might 
wish  to  know  his/her  height,  weight,  blood  pressure,  pulse  rate,  etc.  All  these 
measurements  and  others  are  used  by  a  physician  for  a  disease  risk  assessment. 
Hence,  the  mapping  from  the  sample  space  of  college  students  to  the  important 
measurements  of  height  and  weight,  for  example,  would  be  H(Si)  =  hi  and  W(Si)  = 
Wi ,  where  H  and  W  represent  the  height  and  weight  of  the  student  selected.  In  Table 

4.1  we  summarized  a  hypothetical  set  of  probabilities  for  heights  and  weights.  The 
table  is  a  two-dimensional  array  that  lists  the  probabilities  P[H  =  hi  and  W  =  Wj]. 
This  information  can  also  be  displayed  in  a  three-dimensional  format  as  shown  in 
Figure  7.1,  where  we  have  associated  the  center  point  of  each  interval  of  height  and 
weight  given  in  Table  4.1  with  the  probability  displayed.  These  probabilities  were 
termed  joint  probabilities.  In  this  chapter  we  discuss  the  case  of  multiple  random 
variables.  For  example,  the  height  and  weight  could  be  represented  as  a  2  x  1  random 
vector 

"  H  ' 

W 

and  as  such,  its  value  is  located  in  the  plane  (also  called  R 2).  We  will  initially 
describe  the  simplest  case  of  two  random  variables  but  all  concepts  are  easily  ex- 
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Figure  7.1:  Joint  probabilities  for  heights  and  weights  of  college  students. 


tended  to  any  finite  number  of  random  variables  (see  Chapter  9  for  this  extension). 
As  we  will  see  throughout  our  discussions,  the  new  and  very  important  concept 
will  be  the  dependencies  between  the  multiple  random  variables.  Questions  such 
as  “Can  we  predict  a  person’s  height  from  his  weight?”  naturally  arise  and  can  be 
addressed  once  we  extend  our  description  of  a  single  random  variable  to  multiple 
random  variables. 


7.2  Summary 

The  concept  of  jointly  distributed  discrete  random  variables  is  illustrated  in  Figure 
7.2.  Two  random  variables  can  be  thought  of  as  a  random  vector  and  assigned  a  joint 
PMF  px,Y[%i,yj]  as  described  in  Section  7.3,  and  which  has  Properties  7.1  and  7.2. 
The  joint  PMF  may  be  obtained  if  the  probabilities  on  the  original  experimental 
sample  space  is  known  by  using  (7.2),  and  is  illustrated  in  Example  7.1.  Once 
the  joint  PMF  is  specified,  the  probability  of  any  event  concerning  the  random 
variables  is  determined  via  (7.3).  The  marginal  PMFs  of  the  two  random  variables, 
which  are  the  probabilities  of  each  random  variable  taking  on  its  possible  values,  is 
obtained  from  the  joint  PMF  using  (7.5)  and  (7.6).  However,  the  joint  PMF  is  not 
uniquely  determined  from  the  marginal  PMFs.  The  joint  CDF  is  defined  by  (7.7) 
and  evaluated  using  (7.8).  It  has  the  usual  properties  as  summarized  via  Properties 
7. 3-7. 6.  Random  variables  are  defined  to  be  independent  if  the  probabilities  of 
all  the  joint  events  can  be  found  as  the  product  of  the  probabilities  of  the  single 
events.  If  the  random  variables  are  independent,  then  the  joint  PMF  factors  as  in 
(7.11).  Given  a  joint  PMF,  independence  can  be  established  by  determining  if  the 
PMF  factors.  Conversely,  if  we  know  the  random  variables  are  independent,  and 
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we  are  given  the  marginal  PMFs,  then  the  joint  PMF  is  found  as  the  product  of 
the  marginals.  The  joint  PMF  of  a  transformed  vector  random  variable  is  given  by 
(7.12)  and  illustrated  in  Example  7.6.  The  PMF  for  the  sum  of  two  independent 
discrete  random  variables  can  be  found  using  (7.22)  or  via  characteristic  functions 
using  (7.24).  The  expected  value  of  a  function  of  two  random  variables  is  found 
from  (7.28).  Also,  the  variance  of  the  sum  of  two  random  variables  is  given  by 
(7.33)  and  involves  the  covariance,  which  is  defined  by  (7.34).  The  interpretation  of 
the  covariance  is  given  in  Section  7.8  and  is  seen  to  provide  a  quantification  of  the 
knowledge  of  the  outcome  of  one  random  variable  on  the  probability  of  the  other. 
Independent  random  variables  have  a  covariance  of  zero,  but  the  converse  is  not 
true.  In  Section  7.9  linear  prediction  of  one  random  variable  based  on  observation 
of  another  random  variable  is  explored.  The  optimal  linear  predictor  is  given  by 
(7.41).  A  variation  of  this  prediction  equation  results  in  the  important  parameter 
called  the  correlation  coefficient  (7.43).  It  quantifies  the  relationship  of  one  random 
variable  with  another.  However,  a  nonzero  correlation  does  not  indicate  a  causal 
relationship.  The  joint  characteristic  function  is  introduced  in  Section  7.10  and 
is  defined  by  (7.45)  and  evaluated  by  (7.46).  It  is  shown  to  provide  a  convenient 
means  of  determining  the  PMF  for  a  sum  of  independent  random  variables.  In 
Section  7.11  a  method  to  simulate  a  random  vector  is  described.  Also,  methods  to 
estimate  joint  PMFs,  marginal  PMFs,  and  other  quantities  of  interest  are  given. 
Finally,  in  Section  7.12  an  application  of  the  methods  of  the  chapter  to  disease  risk 
assessment  is  described. 

7.3  Jointly  Distributed  Random  Variables 

We  consider  two  discrete  random  variables  that  will  be  denoted  by  X  and  Y.  As 
alluded  to  in  the  introduction,  they  represent  the  functions  that  map  an  outcome 
of  an  experiment  Si  to  a  value  in  the  plane.  Hence,  we  have  the  mapping 

X  (<s^)  X{ 

.  Y(si)  J  L  Vi  _ 

for  all  Si  E  S.  An  example  is  shown  in  Figure  7.2  in  which  the  experiment  consists 
of  the  simultaneous  tossing  of  a  penny  and  a  nickel.  The  outcome  in  the  sample 
space  S  is  represented  by  a  TH,  for  example,  if  the  penny  comes  up  tails  and  the 
nickel  comes  up  heads.  Explicitly,  the  mapping  is 
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Figure  7.2:  Example  of  mapping  for  jointly  distributed  discrete  random  variables. 


X(Si) 

Y(Si) 


Two  random  variables  that  are  defined  on  the  same  sample  space  S  are  said  to  be 
jointly  distributed.  In  this  example,  the  random  variables  are  also  discrete  random 
variables  in  that  the  possible  values  (which  are  actually  2x1  vectors)  are  countable. 
In  this  case  there  are  just  four  vector  values.  These  values  comprise  the  sample 
space  which  is  the  subset  of  the  plane  given  by 


We  can  also  refer  to  the  two  random  variables  as  the  single  random  vector  [X  Y]T, 
where  T  denotes  the  vector  transpose.  Hence,  we  will  use  the  terms  multiple  random 
variables  and  random  vector  interchangeably.  The  values  of  the  random  vector  will 
be  denoted  either  by  (#,?/),  which  is  an  ordered  pair  or  a  point  in  the  plane,  or  by 
[x  y]T,  which  denotes  a  two-dimensional  vector.  These  notations  will  be  synonomous. 

The  size  of  the  sample  space  for  discrete  random  variables  can  be  finite  or  count¬ 
ably  infinite.  In  the  example  of  Figure  7.2,  since  X  can  take  on  2  values,  denoted 
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by  Nx  —  2,  and  Y  can  take  on  2  values,  denoted  by  Ny  =  2,  the  total  number 
of  elements  in  Sx,y  is  NxNy  =  4.  More  generally,  if  X  can  take  on  values  in 
Sx  —  {xi,  rr2, . . . ,  %nx  }  an(i  ^  can  ^a^e  on  values  in  Sy  —  {yi ,  7/2, . . . ,  yjvy  },  then 
the  random  vector  can  take  on  values  in 

Sx,y  =  Sx  x  Sy  =  {(xi,yj)  :  i  =  1,2, . . .  ,Nx;j  =  1,2, ... ,  iVy} 

for  a  total  of  iVx,y  =  NxNy  values.  This  is  shown  in  Figure  7.3  for  the  case  of 
Nx  =  4  and  Ny  =  3.  The  notation  A  x  B,  where  A  and  B  are  sets,  denotes  a 
cartesian  product  set.  It  consists  of  all  ordered  pairs  where  a*  G  A  and 

bj  £  B.  If  either  Sx  or  Sy  is  countably  infinite,  then  the  random  vector  will  also 
have  a  countably  infinite  set  of  values. 


Figure  7.3:  Example  of  sample  space  for  jointly  distributed  discrete  random  vari¬ 
ables. 


Just  as  we  defined  the  PMF  for  a  single  discrete  random  variable  in  Chapter  5 
as  px[xi]  =  P[X(s)  =  xi\,  we  can  define  the  joint  PMF  (or  sometimes  called  the 
bivariate  PMF)  as 

Px,y[x»  Vj]  =  P[X(5)  =  Xi,  Y (5)  =yj]  i  =  1, 2, ,  Nx\  j  =  1,2,...,  Ny. 

Note  that  the  set  of  all  outcomes  5  for  which  X(s)  =  Xi,Y (s)  =  yj  is  the  same  as 
the  set  of  outcomes  for  which 

’  Xi  ' 

.  Vj  . 

so  that  for  the  random  vector  to  have  the  value  [xi  yj]T,  both  X(s)  =  Xi  andY(s )  = 
yj  must  be  satisfied.  Thus,  the  comma  used  in  the  statement  X(s)  —  Xi,Y (s)  —  yj  is 
to  be  read  as  “and” .  An  example  of  the  joint  PMF  for  students’  heights  and  weights 
is  given  in  Figure  7.1  in  which  we  set  X  =  height  and  Y  =  weight  and  the  vertical 
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axis  represents  px . y  [r  j  •  l)j\-  To  verify  that  a  set  of  probabilities  as  in  Figure  7.1  can 
be  viewed  as  a  joint  PMF  we  need  only  verify  the  usual  properties  of  probability. 
Assuming  Nx  and  Ny  are  finite,  these  are: 

Property  7.1  -  Range  of  values  of  joint  PMF 

0  <  Px,Y[xi,yj\  <1  i  =  1,2, ...  ,Nx;j  =  1,2  ,...,Ny. 


□ 


Property  7.2  —  Sum  of  values  of  joint  PMF 


Nx  Ny 

EE^M  =  ! 


1=1  j= 1 


□ 

and  similarly  for  a  countably  infinite  sample  space.  For  the  coin  toss  example  of 
Figure  7.2  we  require  that 

0  <  Px,y[0,0]  <  1 

0  <  Px,y[ 0, 1]  <  1 

0  <  px,r[l,0]  <  1 

0  <  Px.yfl,  1]  <  1 


and 


l  l 

EEp*>yM  =  L 

2=0  j= 0 


Many  possibilities  exist.  For  two  fair  coins  that  do  not  interact  as  they  are  tossed 
(i.e.,  they  are  independent)  we  might  assign  px,y[hj\  =  1/4  for  all  i  and  j.  For  two 
coins  that  are  weighted  but  again  do  not  interact  with  each  other  as  they  are  tossed, 
we  might  assign 


Px,r[iJ\  =  < 


a  -p? 

(i  -p)p 

p{  i  -p) 
„2 


*  =  o,j  =  o 
i  =  0,j  =  l 
t  =  i,i  =  o 

*  =  U  =  1 


if  each  coin  has  a  probability  of  heads  of  p.  It  is  easily  shown  that  the  joint  PMF 
satisfies  Properties  7.1  and  7.2  for  any  0  <  p  <  1.  In  obtaining  these  values  for 
the  joint  PMF  we  have  used  the  concept  of  equivalent  events,  which  allows  us  to 
determine  probabilities  for  events  defined  on  Sx,y  from  those  defined  on  the  original 
sample  space  S.  For  example,  since  the  events  TH  and  (0, 1)  are  equivalent  as  seen 
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in  Figure  7.2,  we  have  that 


Px,y%  1]  = 


P[X(s)  =  0,T(s)  =  1] 
P[{si:X(si)=0,Y(si)  =  l}} 
P[Si  =  TH] 

(1  -p)p 


(equivalent  event  in  S) 
(mapping  is  one-to-one) 
(independence) 


where  we  have  assumed  independence  of  the  penny  and  nickel  toss  subexperiments 
as  described  in  Section  4.6.1. 

In  general,  the  procedure  to  determine  the  joint  PMF  from  the  probabilities 
defined  on  S  depends  on  whether  the  random  variable  mapping  is  one-to-one  or 
many-to-one.  For  a  one-to-one  mapping  from  S  to  Sx,y  we  have 


PX,y[xu  yj]  =  P[X(s)  =  Xi,  Y (5)  =  yj] 

=  P[{5:X(5)=^,y(5)=yj}] 
=  P[{Sk}} 


where  it  is  assumed  that  is  the  only  solution  to  -X’(s)  =  Xi  and  X(s)  =  yj.  For  a 
many-to-one  transformation  the  joint  PMF  is  found  as 

PxAxiiVi]  =  ^2  ^[{5*}].  (7.2) 


This  is  the  extension  of  (5.1)  and  (5.2)  to  a  two-dimensional  random  vector.  An 
example  follows. 

Example  7.1  —  Two  dice  toss  with  different  colored  dice 

A  red  die  and  a  blue  die  are  tossed.  The  die  that  yields  the  larger  number  of  dots 
is  chosen.  If  they  both  display  the  same  number  of  dots,  the  red  die  is  chosen.  The 
numerical  outcome  of  the  experiment  is  defined  to  be  0  if  the  blue  die  is  chosen  and 
1  if  the  red  die  is  chosen,  along  with  its  corresponding  number  of  dots.  The  random 
vector  is  therefore  defined  as 

„  _  f  0  blue  die  chosen 

\  1  red  die  chosen 

Y  —  number  of  dots  on  chosen  die. 

The  outcomes  of  the  experiment  can  be  represented  by  (i,j)  where  *  =  0  for  blue, 
*  =  1  for  red,  and  j  is  the  number  of  dots  observed.  What  then  is  px(y[l,3],  for 
example?  To  determine  this  we  first  list  all  outcomes  in  Table  7.1  for  each  number  of 
dots  observed  on  the  red  and  blue  dice.  It  is  seen  that  the  mapping  is  many-to-one. 
For  example,  if  the  red  die  displays  6  dots,  then  the  outcome  is  the  same,  which  is 
(1,6),  for  all  possible  blue  outcomes.  To  determine  the  desired  value  of  the  PMF, 
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blue=l 

blue=2 

blue=3 

blue=4 

blue=5 

blue=6 

red=l 

(i,i) 

(0,2) 

(0,3) 

(0,4) 

(0,5) 

(0,6) 

red=2 

(1,2) 

(1,2) 

(0,3) 

(0,4) 

(0,5) 

(0,6) 

red=3 

(1,3) 

(1,3) 

(1,3) 

(0,4) 

(0,5) 

(0,6) 

II 

<D 

(1,4) 

(1,4) 

(1,4) 

(1,4) 

(0,5) 

(0,6) 

red=5 

(1,5) 

(1,5) 

(1,5) 

(1,5) 

(1,5) 

(0,6) 

red=6 

(1,6) 

(1,6) 

(1,6) 

(1,6) 

(1,6) 

(1,6) 

Table  7.1:  Mapping  of  outcomes  in  S  to  outcomes  in  Sx,y •  The  outcomes  of  (X,  Y) 
are  (i,  j),  where  i  indicates  the  color  of  the  die  with  more  dots  (red=l5  blue=0),  j 
indicates  the  number  of  dots  on  that  die. 

we  assume  that  each  outcome  in  S  is  equally  likely  and  therefore  is  equal  to  1/36. 
Then,  from  (7.2) 


Px,y[l,3]  =  ^  ^[{s*}] 

{k:X(Sk)=l,Y(Sk)=3} 


^ J  36 

{k:X(Sk)=l,Y(Sk)=3} 

_3_  _  J_ 

36  ~  12 

since  there  are  three  outcomes  of  the  experiment  in  <S  that  map  into  (1,3).  They 
are  (red=3,blue=l),  (red=3,blue=2),  and  (red=3,blue=3). 

<> 

In  general,  as  in  the  case  of  a  single  random  variable  we  can  use  the  joint  PMF 
to  compute  probabilities  of  all  events  defined  on  Sx,y  =  Sx  x  <Sy.  For  the  event 
A  C  S x ,y  -  the  probability  is 

P[(X,  Y)  s  A]  =  •£  [piiDj]-  (7-3) 

Once  we  have  knowledge  of  the  joint  PMF,  we  no  longer  need  to  retain  the  underlying 
sample  space  S  of  the  experiment.  All  our  probability  calculations  can  be  made 
concerning  values  of  (X,  Y)  by  using  (7.3). 

7.4  Marginal  PMFs  and  CDFs 

If  the  joint  PMF  is  known,  then  the  PMF  for  X,  i.e.,  px[%i\,  and  the  PMF  for  Y, 
i.e.,  py[yj].  can  be  determined.  These  are  termed  the  marginal  PMFs.  Consider 
first  the  determination  of  px[xj\-  Since  {X  =  xt}  does  not  specify  any  particular 
value  for  Y,  the  event  {X  =  Xi}  is  equivalent  to  the  joint  event  {X  —  Xi,Y  e  <Sy}. 
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To  determine  the  probability  of  the  latter  event  we  assume  the  general  case  of  a 
countably  infinite  sample  space.  Then,  (7.3)  becomes 

OO  OO 

P[(X,Y)cA}  =  Px,Y[*i,yj}-  (7-4) 

i= 1  j= 1 

Next  let  A  =  {xk}  x  Sy ,  which  is  illustrated  in  Figure  7.4  for  k  =  3.  Then,  we  have 


A  =  {rc3}  X  <Sy 


Figure  7.4:  Determination  of  marginal  PMF  value  px[x 3]  from  joint  PMF 
Px,y[xi,yj ]  by  summing  along  y  direction. 


P[(X,y)€h}x5y] 


=  P[X  =  xk,YeSY] 

=  P[X  =  xk] 

=  px[xk] 


so  that  from  (7.4)  with  i  —  k  only 


OO 

Px[x  k]  =  ^ ~^Px,v[xk,yj )  (7.5) 

3= 1 

and  is  obtained  for  k  =  3  by  summing  the  probabilities  along  the  column  shown 
in  Figure  7.4.  The  terminology  “marginal”  PMF  originates  from  the  process  of 
summing  the  probabilities  along  each  column  and  writing  the  results  in  the  margin 
(below  the  x  axis),  much  the  same  as  the  process  for  computing  the  marginal  prob¬ 
ability  discussed  in  Section  4.3.  Likewise,  by  summing  along  each  row  or  in  the  x 
direction  we  obtain  the  marginal  PMF  for  Y  as 

OO 

py[yk]  =  ^2px,Y[xi,yk].  (7.6) 

i= 1 

In  summary,  we  see  that  from  the  joint  PMF  we  can  obtain  the  marginal  PMFs. 
Another  example  follows. 
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Example  7.2  —  Two  coin  toss 

As  before  we  toss  a  penny  and  a  nickel  and  map  the  outcomes  into  a  1  for  a  head 
and  a  0  for  a  tail.  The  random  vector  is  (X,  Y),  where  X  is  the  random  variable 
representing  the  penny  outcome  and  Y  is  the  random  variable  representing  the  nickel 
outcome.  The  mapping  is  shown  in  Figure  7.2.  Consider  the  joint  PMF 


/ 


Px,y[hj]  =  { 


1 
8 
1 
8 
1 
4 
1 

2 


i  =  0,j  =  0 
<  =  0,j  =  l 
i  =  l,J=0 
i  =  1,J  =  1- 


Then,  the  marginal  PMFs  are  given  as 


Px[i] 
PY  [j] 


0 

1 


As  expected,  X)i=o  Px[*]  =  1  and  ]C]=o  PY  [j]  =  1.  We  could  also  have  arranged  the 
joint  PMF  and  marginal  PMF  values  in  a  table  as  shown  in  Table  7.2.  Note  that 


nH 

II 

o 

II 

Px[i] 

?  =  0 

i  =  1 

l  l 

8  8 

1  1 

4  2 

1 

4 

3 

4 

PY  [j] 

3  5 

8  8 

Table  7.2:  Joint  PMF  and  marginal  PMF  values  for  Examples  7.2  and  7.4. 

the  marginal  PMFs  are  found  by  summing  across  a  row  (for  px)  or  a  column  (for 
Py )  and  are  written  in  the  “margins”. 

❖ 


Joint  PMF  cannot  be  determined  from  marginal  PMFs. 


Having  obtained  the  marginal  PMFs  from  the  joint  PMF,  we  might  suppose  we 
could  reverse  the  process  to  find  the  joint  PMF  from  the  marginal  PMFs.  However, 
this  is  not  possible  in  general.  To  see  why,  consider  the  joint  PMF  summarized  in 
Table  7.3.  The  marginal  PMFs  are  the  same  as  the  ones  shown  in  Table  7.2.  In 
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O 

II 

3  =  1 

px[i\ 

i  =  0 

1 

16 

3 

16 

1 

4 

i  -  1 

5 

7 

3 

16 

16 

4 

PY[ji] 

3 

8 

5 

8 

Table  7.3:  Joint  PMF  values  for  “caution”  example. 


fact,  there  are  an  infinite  number  of  joint  PMFs  that  have  the  same  marginal  PMFs. 
Hence, 

joint  PMF  marginal  PMFs 

but 

marginal  PMFs  ^  joint  PMF. 


A 

A  joint  cumulative  distribution  function  (CDF)  can  also  be  defined  for  a  random 
vector.  It  is  given  by 

Fx,Y(x,y)  =  P[X  <x,Y  <y]  (7.7) 

and  can  be  found  explicitly  by  summing  the  joint  PMF  as 

Fx,r(x,y)=  Px,Y[x»yj\.  (7.8) 

{{ij):xi<x,yj<y} 

An  example  is  shown  in  Figure  7.5,  along  with  the  joint  PMF.  The  marginal  CDFs 
can  be  easily  found  from  the  joint  CDF  as 

Fx(x)  =  P[X  <  x]  =  P[X  <  x,Y  <  oo]  =  Fx,y(x,oo) 

Fy(y)  =  P[Y  <  y]  =  P[X  <  oo,Y  <y]  =  Fx,y{oo,y). 

The  joint  CDF  has  the  usual  properties  which  are: 

Property  7.3  —  Range  of  values 


0  <  Fx,y(x,y)  <  1 


Property  7.4  -  Values  at  “endpoints” 


Fx,y{~ °o,  -oo) 
Fx,y{oo,  oo) 


0 

1 


□ 
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Property  7.5  —  Monotonically  increasing 

Fx,y(xiV)  monotonically  increases  as  x  and/or  y  increases. 


Property  7.6  —  “Right”  continuous 

As  expected,  the  joint  CDF  takes  the  value  after  the  jump.  However,  in  this  case 
the  jump  is  a  line  discontinuity  as  seen,  for  example,  in  Figure  7.5b.  After  the  jump 
means  as  we  move  in  the  northeast  direction  in  the  x-y  plane. 


(b)  Joint  CDF 


Figure  7.5:  Joint  PMF  and  corresponding  joint  CDF. 


□ 

The  reader  is  asked  to  verify  some  of  these  properties  in  Problem  7.17.  Finally,  to 
recover  the  PMF  we  can  use 

Px,y[xi,yj]  =  FXyY(xf  ,yf)  -  Fx,y (xf ,  yj )  -  Fx,y (x^ ,  yj )  +  Fx>Y{x~  ,y~).  (7.9) 

The  reader  should  verify  this  formula  for  the  joint  CDF  shown  in  Figure  7.5b.  In 
particular,  consider  the  joint  PMF  at  the  point  {x^yf)  =  (2, 2)  to  see  why  we  need 
four  terms. 


7.5  Independence  of  Multiple  Random  Variables 

Consider  the  experiment  of  tossing  a  coin  and  then  a  die.  The  outcome  of  the  coin 
toss  is  denoted  by  X  and  equals  0  for  a  tail  and  1  for  a  head.  The  outcome  for  the 
die  is  denoted  by  Y,  which  takes  on  the  usual  values  1,2, 3,4, 5, 6.  In  determining 
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the  probability  of  the  random  vector  (X,  Y)  taking  on  a  value,  there  is  no  reason 
to  believe  that  the  probability  of  Y  =  yj  should  depend  on  the  outcome  of  the  coin 
toss.  Likewise,  the  probability  of  X  =  X{  should  not  depend  on  the  outcome  of  the 
die  toss  (especially  since  the  die  toss  occurs  at  a  later  time).  We  expect  that  these 
two  events  are  independent.  The  formal  definition  of  independent  random  variables 
X  and  Y  is  that  they  are  independent  if  all  the  joint  events  on  Sx,y  are  independent. 
Mathematically  X  and  Y  are  independent  random  variables  if  for  all  events  A  C  Sx 
and  B  C  Sy 

P[X  e  A,  Y  e  B]  =  P[x  e  A\P[Y  e  B\.  (7.10) 

The  probabilities  on  the  right-hand-side  of  (7.10)  are  defined  on  Sx  and  Sy ,  respec¬ 
tively  (see  Figure  7.3  for  an  example  of  the  relationship  of  Sx,Sy  to  <Sx,y)-  The 
utility  of  the  independence  property  is  that  the  probabilities  of  joint  events  may 
be  reduced  to  probabilities  of  “marginal  events”  (defined  on  Sx  and  <Sy),  which 
are  always  easier  to  determine.  Specifically,  if  X  and  Y  are  independent  random 
variables,  then  it  follows  from  (7.10)  that 

Px,Y[xi,yj\  =  px[xi]pY[yj\  (7.11) 

as  we  now  show.  If  A  =  {xi}  and  B  =  {yj},  then  the  left-hand-side  of  (7.10) 
becomes 

P[XeA,YeB]  =  P[X  =  Xi,Y  =  yj] 

=  Px,y[xi,yj] 

and  the  right- hand-side  of  (7.10)  becomes 

P[X  €  A]P[Y  e  B]  =  pX[xi]pY[yj\. 

Hence,  if  X  and  Y  are  independent  random  variables,  the  joint  PMF  factors  into 
the  product  of  the  marginal  PMFs.  Furthermore,  the  converse  is  true — if  the  joint 
PMF  factors,  then  X  and  Y  are  independent.  To  prove  the  converse  assume  that 
the  joint  PMF  factors  according  to  (7.11).  Then  for  all  A  and  B  we  have 

P[XeA,YeB]  =  ^  X  Px,y[xii Vj\  (from  (7.3)) 

{i:xi£A}  {jiyjeB} 

-  E  E  Px[xi]pY[yj]  (assumption) 

{i'.XiZA}  {j:yjeB} 

=  X)  X  pAvj] 

{i:Xi£A}  {j'-yjZB} 

=  P[x  e  A]P[Y  e  B]. 

We  now  illustrate  the  concept  of  independent  random  variables  with  some  examples. 
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Example  7.3  —  Two  coin  toss  —  independence 

Assume  that  we  toss  a  penny  and  a  nickel  and  that  as  usual  a  tail  is  mapped  into 
a  0  and  a  head  into  a  1.  If  all  outcomes  are  equally  likely  or  equivalently  the  joint 
PMF  is  given  in  Table  7.4,  then  the  random  variables  must  be  independent.  This  is 


3  =  0  3  =  1 

Px[i] 

i  =  0 

1  1 

4  4 

1 

2 

A  —  1 

1  1 

1 

L  —  1 

4  4 

2 

Pv[j } 

1  1 

2  2 

Table  7.4:  Joint  PMF  and  marginal  PMF  values  for  Example  7.3. 


because  we  can  factor  the  joint  PMF  as 


Pxy[i,j\ 


=  px[i]pv[j] 


for  all  i  and  j  for  which  px,Y[hj]  is  nonzero.  Furthermore,  the  marginal  PMFs 
indicate  that  each  coin  is  fair  since  px[ 0]  =  Px[  1]  =  1/2  and  py[ 0]  =  py[  1]  =  1/2. 

0 


Example  7.4  —  Two  coin  toss  —  dependence 

Now  consider  the  same  experiment  but  with  a  joint  PMF  given  in  Table  7.2.  We 
see  that  px,y[0, 0]  =  1/8  ^  (1/4)  (3/8)  —  px[0]py[0]  and  hence  X  and  Y  cannot 
be  independent.  If  two  random  variables  are  not  independent,  they  are  said  to  be 
dependent. 

0 


Example  7.5  —  Two  coin  toss  -  dependent  but  fair  coins 

Consider  the  same  experiment  again  but  with  the  joint  PMF  given  in  Table  7.5. 
Since px,Y [0, 0]  =  3/8  ^  (1/2) (1/2)  =px[0]py[0],  X  andT  are  dependent  However, 


t-H 

II 

*04 

o 

II 

•o» 

Px[i] 

*  =  0 

i  =  1 

3  1 

8  8 

1  3 

8  8 

1 

2 

1 

2 

PY  [j] 

1  1 

2  2 

Table  7.5:  Joint  PMF  and  marginal  PMF  values  for  Example  7.5. 


7.6 .  TRANSFORMATIONS  OF  MULTIPLE  RANDOM  VARIABLES 


181 


by  examining  the  marginal  PMFs  we  see  that  the  coins  are  in  some  sense  fair  since 
P  [heads]  =  1/2,  and  therefore  we  might  conclude  that  the  random  variables  were 
independent.  This  is  incorrect  and  underscores  the  fact  that  the  marginal  PMFs 
do  not  tell  us  much  about  the  joint  PMF.  The  joint  PMF  of  Table  7.4  also  has  the 
same  marginal  PMFs  but  there  X  and  Y  were  independent. 

❖ 

Finally,  note  that  if  the  random  variables  are  independent,  the  joint  CDF  factors 
as  well.  This  is  left  as  an  exercise  for  the  student  (see  Problem  7.20).  Intuitively,  if 
X  and  Y  are  independent  random  variables,  then  knowledge  of  the  outcome  of  X 
does  not  change  the  probabilities  of  the  outcomes  of  Y .  This  means  that  we  cannot 
predict  Y  based  on  knowing  that  X  =  Xj,.  Our  best  predictor  of  Y  is  just  E\Y],  as 
described  in  Example  6.3.  When  X  and  Y  are  dependent,  however,  we  can  improve 
upon  the  predictor  E[Y]  by  using  the  knowledge  that  X  =  X{.  How  we  actually  do 
this  is  described  in  Section  7.9. 


7.6  Transformations  of  Multiple  Random  Variables 


In  Section  5.7  we  have  seen  how  to  find  the  PMF  of  Y  —  g(X)  if  the  PMF  of  X  is 
given.  It  is  determined  using 

PriVi]  =  Px[xj\- 

{r-9(*j)=yi} 

We  need  only  sum  the  probabilities  of  the  xj's  that  map  into  y% .  In  the  case  of  two 
discrete  random  variables  X  and  Y  that  are  transformed  into  W  =  g(X,Y)  and 
Z  —  h(X, F),  we  have  the  similar  result 


Pw,z[wi,  Zj]  = 


Px,y[xk,yi ]  i  =  1,2,. .  .,Nw;j  =  1,2 ,NZ 


9(xk’y0=wi 

h(xk,yi)=:zj 


where  Nw  and/or  Nz  may  be  infinite.  An  example  follows. 


(7.12) 


Example  7.6  -  Independent  Poisson  random  variables 

Assume  that  the  joint  PMF  is  given  as  the  product  of  the  marginal  PMFs,  where 
each  marginal  PMF  is  a  Poisson  PMF.  Then, 


Px,v[k,l]  =  exp[-(Ax  +  A;  =  0, 1,  ...;/  =  0, 1, .. .  (7.13) 

Note  that  X  ~  Pois(Ax),  Y  ~  Pois(Ay),  and  X  and  Y  are  independent  random 
variables.  Consider  the  transformation 


W  =  g(X,Y)  =  X 
Z  =  h(X,  Y)  =  X  +  Y. 


(7.14) 
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The  possible  values  of  W  are  those  of  X,  which  are  0, 1, . . and  the  possible  values 
of  Z  are  also  0,1, _ According  to  (7.12),  we  need  to  determine  all  (k,l)  so  that 

g{xk,yi)  -  m 

h(xk,yi )  =  Zj.  (7.15) 

But  Xk  and  yi  can  be  replaced  by  k  and  Z,  respectively,  for  k  =  0, 1, . . .  and  Z  = 
0,1,  —  Also,  Wi  and  zj  can  be  replaced  by  i  and  j,  respectively,  for  i  =  0, 1, . . .  and 
j  =  0,1,....  The  transformation  equations  become 

g(k,l)  -  i 
h(k,l)  =  j 


which  from  (7.14)  become 


i  =  k 

j  =  k  +  1. 


Solving  for  ( k,l )  for  the  given  (i,j)  desired,  we  have  that  k  —  i  and  l  =  j  —  i  >  0, 
which  is  the  only  solution.  Note  that  from  (7.13)  the  joint  PMF  for  X  and  Y  is 
nonzero  only  if  l  =  0,1,  —  Therefore,  we  must  have  l  >  0  so  that  l  =  j  —  i  >  0. 
From  (7.12)  we  now  have 


oo  oo 


Pw,z[hj]  = 


Px,Y[k,l] 

k= 0 1=0 

{(k,l):k=i,l=j—i>0} 

Px,Y[iJ  -  i]u[i]u\j  -  i) 


where  u[n]  is  the  discrete  unit  step  sequence  defined  as 


u[n]  — 

Finally,  we  have  upon  using  (7.13) 


0  n  =  . . . ,  —2,  — 1 
1  n  =  0, 1, ...  . 


Pw,z[iJ]  =  exp[- 


(Ax  +  Ay)]^^yyu[i]u[j  -  i] 


=  exp[— 


(Ax  +  Ay)]>Ay 


i  =  0,1,. 
i\(j-i)\  j  =  i,i  + 1, 


(7.16) 


(7.17) 


(7.18) 
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Use  the  discrete  unit  step  sequence  to  avoid  mistakes. 


As  we  have  seen  in  the  preceding  example,  the  discrete  unit  step  sequence  was 
introduced  to  designate  the  region  of  the  w-z  plane  over  which  Pw,z[hj]  is  nonzero. 
A  common  mistake  in  problems  of  this  type  is  to  disregard  this  region  and  assert 
that  the  joint  PMF  given  by  (7.18)  is  nonzero  over  i  =  0, 1, . . .  =  0, 1,  —  Note, 

however,  that  the  transformation  will  generally  change  the  region  over  which  the 
new  joint  PMF  is  nonzero.  It  is  as  important  to  determine  this  region  as  it  is  to 
find  the  analytical  form  of  pw,z-  To  avoid  possible  errors  it  is  advisable  to  replace 
(7.13)  at  the  outset  by 


Px,Y[kJ\  =  exp[-(Ax  +  Ay)]  u[k]u[l]. 

Then,  the  use  of  the  unit  step  functions  will  serve  to  keep  track  of  the  nonzero  PMF 
regions  before  and  after  the  transformation.  See  also  Problem  7.25  for  another 
example. 

A 

We  sometimes  wish  to  determine  the  PMF  of  Z  =  h(X,Y)  only,  which  is  a  trans¬ 
formation  from  (X,  Y)  to  Z.  In  this  case,  we  can  use  an  auxiliary  random  variable. 
That  is  to  say,  we  add  another  random  variable  W  so  that  the  transformation  be¬ 
comes  a  transformation  from  (X,  Y)  to  (W,  Z)  as  before.  We  can  then  determine 
Pw,zhjui’>  Zj]  by  once  again  using  (7.12),  and  then  pz ,  which  is  the  marginal  PMF, 
can  be  found  as 

Pz[zj)  =  Pw,z[wi,Zj}.  (7.19) 

{i:Wi£Sw} 

As  we  have  seen  in  the  previous  example,  we  will  first  need  to  solve  (7.15)  for  x &  and 
yi.  To  facilitate  the  solution  we  usually  define  a  simple  auxiliary  random  variable 
such  as  W  —  X. 

Example  7.7  —  PMF  for  sum  of  independent  Poisson  random  variables 
(continuation  of  previous  example) 

To  find  the  PMF  of  Z  =  X  +  Y  from  the  joint  PMF  given  by  (7.13),  we  use  (7.19) 
with  W  —  X.  We  then  have  Sw  =  $x  =  {0, 1, . . .}  and 


Pz\j] 


OO 


=  Y,Pw,z[i,j] 


i= 0 

OO 


=  Y  exp[~  (■ A*  +  )1 7ufz7)\u®uti  ~  *1 

i- 0  '  ^  *' 


(from  (7.19))  (7.20) 


(from  (7.17)) 
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and  since  u[i\  =  1  for  i  —  0, 1, . . .  and  u\j  —  i]  —  1  for  i  =  0, 1, . . .  ,j  and  u[j  —  i]  =  0 
for  i  >  j,  this  reduces  to 


AxAy~* 


Note  that  Z  can  take  on  values  j  =  0, 1, . . .  since  Z  =  X  +  Y  and  both  X  and  Y 
take  on  values  in  {0, 1, . . .}.  To  evaluate  this  sum  we  can  use  the  binomial  theorem 
as  follows: 


Pz[j] 


exp[-(Ax  +  Ay)]^  Y  (j  ^xXy  * 


1 


exp[-(Ax  +  Ay)]  -  Y  (  )  AxAy  * 

i= 0 


=  exp[—  (Ax  +  Ay)]  —  ( Xx  +  Ay)-7  (use  binomial  theorem) 


=  exp(— A)^- 


(let  A  =  Ax  +  Ay) 


for  j  =  0, 1,  —  This  is  recognized  as  a  Poisson  PMF  with  A  =  Ax  +  Ay.  By  this 
example  then,  we  have  shown  that  if  X  ~  Pois(Ax),  Y  ~  Pois(Ay),  and  X  and 
Y  are  independent,  then  X  +  Y  ~  Pois(Ax  +  Ay).  This  is  called  the  reproducing 
PMF  property.  It  is  also  extendible  to  any  number  of  independent  Poisson  random 
variables  that  are  added  together. 

❖ 

The  formula  given  by  (7.20)  when  we  let  Pw,z[hj]  —  Px,Y[hj  ~  i]  from  (7.16)  is 
valid  for  the  PMF  of  the  sum  of  any  two  discrete  random  variables,  whether  they 
are  independent  or  not.  Summarizing,  if  X  and  Y  are  random  variables  that  take 
on  integer  values  from  — oo  to  -foo,  then  Z  =  X  +  Y  has  the  PMF 


oo 

Pz[j]  =  Y  Px,Y[i,j-i\- 

i=— oo 


(7.21) 


This  result  says  that  we  should  sum  all  the  values  of  the  joint  PMF  such  that  the 
x  value,  which  is  i,  and  the  y  value,  which  is  j  —  i,  sums  to  the  z  value  of  j .  In 
particular,  if  the  random  variables  are  independent ,  then  since  the  joint  PMF  must 
factor,  we  have  the  result 

oo 

pz[j\=  Y  (7-22) 

i=— oo 


But  this  summation  operation  is  a  discrete  convolution  [Jackson  1991].  It  is  usually 
written  succinctly  as  pz  =  Px*Py,  where  *  denotes  the  convolution  operator.  This 
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result  suggests  that  the  use  of  Fourier  transforms  would  be  a  useful  tool  since  a 
convolution  can  be  converted  into  a  simple  multiplication  in  the  Fourier  domain. 
We  have  already  seen  in  Chapter  6  that  the  Fourier  transform  (defined  with  a  +j) 
of  a  PMF  px[k]  is  the  characteristic  function  (j>x(w)  =  E[exp(ju)X)\.  Therefore, 
taking  the  Fourier  transform  of  both  sides  of  (7.22)  produces 

<t>z{w)  =  4>x{u)(j)Y(w)  (7.23) 

and  by  converting  back  to  the  original  sequence  domain,  the  PMF  becomes 

Pz[j]  =  (0x(w)^y(w)}  (7.24) 

where  denotes  the  inverse  Fourier  transform.  An  example  follows. 

Example  7.8  -  PMF  for  sum  of  independent  Poisson  random  variables 
using  characteristic  function  approach 

From  Section  6.7  we  showed  that  if  X  ~  Pois(A),  then 

=  exp  [A(exp(jcj)  -  1)] 

and  thus  using  (7.23)  and  (7.24) 

Pz [j]  =  F~l  {exp  [Ax (exp (ju>)  -  1)]  exp  [Ay  (exp(jw)  -  1)] } 

=  {exp  [(Ax  +  Ay) (exp (jo;)  -  1)]}  . 

But  the  characteristic  function  in  the  braces  is  that  of  a  Poisson  random  variable. 
Using  Property  6.5  we  see  that  Z  ~  Pois(Ajv  +  Ay).  The  use  of  characteristic  func¬ 
tions  for  the  determination  of  the  PMF  for  a  sum  of  independent  random  variables 
has  considerably  simplified  the  derivation. 

0 

In  summary,  if  X  and  Y  are  independent  random  variables  with  integer  values,  then 
the  PMF  of  Z  —  X  +  Y  is  given  by 

Pz[k]  =  F~l  {0jv(^)<^y(^)} 

/n  du 

</)x{c j)0y(cj)  exp  (-juk)  —  .  (7.25) 

When  the  sample  space  Sx,y  is  finite,  it  is  sometimes  possible  to  obtain  the 
PMF  of  Z  =  g(X,  Y)  by  a  direct  calculation,  thus  avoiding  the  need  to  use  (7.19). 
The  latter  requires  one  to  first  find  the  transformed  joint  PMF  pw,z-  To  do  so  we 

1.  Determine  the  finite  sample  space  Sz • 

2.  Determine  which  sample  points  ( x^yj )  in  Sx,y  map  into  each  z k  E  Sz- 

3.  Sum  the  probabilities  of  those  (#i,yj)  sample  points  to  yield  pz[zk]- 
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Mathematically,  this  is  equivalent  to 


pz[zk\=  YY  px’Y ^Xi ,yA- 

{(ijy-Zk=9{xi,yj)} 


(7.26) 


An  example  follows. 

Example  7.9  —  Direct  computation  of  PMF  for  transformed  random 
variable,  Z  =  g(X,Y) 

Consider  the  transformation  of  the  random  vector  (X,  Y)  into  the  scalar  random 
variable  Z  —  X2  +  Y2.  The  joint  PMF  is  given  by 


Px,r[i,j\  = 


3 

8 

1 

8 

1 

8 

3 

8 


*  =  o,i  =  o 

i  =  lj  =  0 

i  =  0,j  =  l 


To  find  the  PMF  for  Z  first  note  that  (X,  Y)  takes  on  the  values  (i,j)  =  (0, 0),  (1, 0), 
(0, 1),  (1, 1).  Therefore,  Z  must  take  on  the  values  Zk  =  i 2  +  j 2  =  0, 1, 2.  Then  from 
(7.26) 


pz[  0]  =  YY  Px,y[hj] 

{(ij):0=i2+j2} 

0  0 

=  YYvxv&j] 

i= 0  j= 0 

=  Px,y[0,0]  =  ^ 

and  similarly 

Pz[  1]  -  Px,y[0,  1]  +jpx,y[l50]  =  - 
Pz[2]  -  Px,y[  1, 1]  -  ~- 

0 


7.7  Expected  Values 

In  addition  to  determining  the  PMF  of  a  function  of  two  random  variables,  we 
are  frequently  interested  in  the  average  value  of  that  function.  Specifically,  if  Z  = 
g(X,  y),  then  by  definition  its  expected  value  is 

Eiz]  =  YZipz^- 


(7.27) 
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To  determine  E[Z]  according  to  (7.27)  we  need  to  first  find  the  PMF  of  Z  and 
then  perform  the  summation.  Alternatively,  by  a  similar  derivation  to  that  given  in 
Appendix  6A,  we  can  show  that  a  more  direct  approach  is 

E[Z]  =Y^'529(xi,yj)Px,Y[xi,Vj]-  (7.28) 

*  3 

To  remind  us  that  we  are  using  px,y  as  the  averaging  PMF,  we  will  modify  our 
previous  notation  from  E[Z\  to  Ex,y[Z\,  where  of  course,  Z  depends  on  X  and  Y. 
We  therefore  have  the  useful  result  that  the  expected  value  of  a  function  of  two 
random  variables  is 


Ex,y\g(X,Y)\  =  y;  y  g(xj,  yj)px,y  fo,  Vj\-  (7.29) 

*  3 


Some  examples  follow. 

Example  7.10  —  Expected  value  of  a  sum  of  random  variables 

If  Z  =  g(X,  Y )  -  X  +  Y ,  then 

Ex,y  [X  +  Y]  =  YY^Xl  +  yj  )px,Y  [xi ,  yj] 

*  3 

=  YY  Xipxx  +  YY  yjpx,Y  ^ ,  yj] 

i  j  i  j 

=  YXi  Ypxxixi’ yj] + Y Ypx’Yixii vA  (from  (7-6)) 

i  j  j  i 

' - v '  N - v - ' 

px[xi]  PrlVj] 

=  Ex[X]  +  Ey[Y]  (definition  of  expected  value). 


Hence,  the  expected  value  of  a  sum  of  random  variables  is  the  sum  of  the  expected 
values.  Note  that  we  now  use  the  more  descriptive  notation  Ex[X]  to  replace  E[X] 
used  previously. 

0 


Similarly 


Ex,Y[aX  +  bY]  =  aEx[X]  +  bEy[Y] 


and  thus  as  we  have  seen  previously  for  a  single  random  variable,  the  expectation 
Ex,y  is  a  linear  operation. 

Example  7.11  -  Expected  value  of  a  product  of  random  variables 

If  g(X,Y)  =  XY,  then 


Ex,y[XY]  =  y  y  XjyjPx,Y[xi,  Vj\- 

*  3 
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We  cannot  evaluate  this  further  without  specifying  px,Y ■  If,  however,  X  and  Y  are 
independent,  then  since  the  joint  PMF  factors,  we  have 

EX,y [ XY ]  =  ^2  H  xiVjPx  [xi]py  [yj\ 

i  3 

=  ^2xiPx[xi\^2yjPY[yj} 

i  3 

=  Ex[X)Ey[Y).  (7.30) 

More  generally,  we  can  show  by  using  (7.29)  that  if  X  and  Y  are  independent,  then 
(see  Problem  7.30) 


Ex,Y\9(X)h(Y)\  =  Ex[g(X)]Ey[h(Y)].  (7.31) 

❖ 


Example  7.12  —  Variance  of  a  sum  of  random  variables 

Consider  the  calculation  of  var(X  +  Y).  Then,  letting  Z  =  g(X,Y)  =  {X  +  Y  — 
E\y [X  +  Y])2,  we  have 

var(X  +  Y) 

=  Ez[Z]  (definition  of  variance) 

=  Exy[g{X,  F)]  (from  (7.28)) 

=  Ex,y[{X +  Y  -  Ex,y[X +  Y])2] 

=  Ex,y[[{X  ~  Ex[X])  +  {Y  -  EY[Y])f] 

=  Ex,y[(X  -  Ex[X ])2  +  2(X  -  Ex[X\){Y  -  Ey[Y ]) 

+  (Y-  Ey[Y ])2] 

=  Ex[(X  -  Ex[X})2}  +  2Ex,Y[(X  ~  Ex[X]){Y  -  Ey[Y])] 

+  Ey[{Y  —  Ey\Y])2]  (linearity  of  expectation) 

=  var(V)  +  2Ex,Y[(X  —  Ex[X])(Y  —  Ey[Y])]  +  var(F)  (definition  of  variance) 

where  we  have  also  used  Ex,Y[g{X)]  =  Ex[g{X)}  and  Ex,y[h{Y)\  =  Ey[h(Y)}  (see 
Problem  7.28).  The  cross-product  term  is  called  the  covariance  and  is  denoted  by 
cov(X,  Y)  so  that 


cov(X,y)  =  Ex,y[(X  -  Ex[X])(Y  -  Ey[Y])).  (7.32) 

Its  interpretation  is  discussed  in  the  next  section.  Hence,  we  finally  have  that  the 
variance  of  a  sum  of  random  variables  is 


var(X  +  Y)  =  var(X)  +  var(Y)  +  2cov(X,  Y). 


(7.33) 
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Unlike  the  expected  value  or  mean,  the  variance  of  a  sum  is  not  in  general  the  sum 
of  the  variances.  It  will  only  be  so  when  cov(X,  y)  =  0.  An  alternative  expression 
for  the  covariance  is  (see  Problem  7.34) 

cov(X,  Y)  =  Ex,y[XY]  -  Ex[X]Ey[Y]  (7.34) 

which  is  analogous  to  Property  6.1  for  the  variance. 

❖ 

7.8  Joint  Moments 

Joint  PMFs  describe  the  probabilistic  behavior  of  two  random  variables  completely. 
At  times  it  is  important  to  answer  questions  such  as  “If  the  outcome  of  one  random 
variable  is  a  given  value,  what  can  we  say  about  the  outcome  of  the  other  random 
variable?  Will  it  be  about  the  same  or  have  the  same  magnitude  or  have  no  relation¬ 
ship  to  the  other  random  variable?”  For  example,  in  Table  4.1,  which  lists  the  joint 
probabilities  of  college  students  having  various  heights  and  weights,  there  is  clearly 
some  type  of  relationship  between  height  and  weight.  It  is  our  intention  to  quantify 
this  type  of  relationship  in  a  succinct  and  meaningful  way  as  opposed  to  a  listing 
of  probabilities  of  the  various  height-weight  pairs.  The  concept  of  the  covariance 
allows  us  to  accomplish  this  goal.  Note  from  (7.32)  that  the  covariance  is  a  joint 
central  moment.  To  appreciate  the  information  that  it  can  provide  we  refer  to  the 
three  possible  joint  PMFs  depicted  in  Figure  7.6.  The  possible  values  of  each  joint 
PMF  are  shown  as  solid  circles  and  each  possible  outcome  has  a  probability  of  1/2. 
In  Figure  7.6a  if  X  =  1,  then  Y  =  1,  and  if  X  =  —1,  then  Y  =  —1.  The  relationship 


(a)  px,y[  —  1,  — 1]  =  (b)  Px,y[  1,-1]  =  (c)  Px,y[  1,1]  = 

PxA  1, 1]  =  1/2  Px,y[  -  1, 1]  =  1/2  Px,y [1,  -1]  =  1/2 


Figure  7.6:  Joint  PMFs  depicting  different  relationships  between  the  random  vari¬ 
ables  X  and  Y. 

is  Y  —  X.  Note,  however,  that  we  cannot  determine  the  value  of  Y  until  after  the 
experiment  is  performed  and  we  are  told  the  value  of  X.  If  X  =  aq,  then  we  know 
that  Y  —  X  —  x\.  Likewise,  in  Figure  7.6b  we  have  that  Y  —  —X  and  so  if  X  —  x\, 
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then  Y  =  —x\.  However,  in  Figure  7.6c  if  X  =  1,  then  Y  can  equal  either  +1  or 
—  1.  On  the  average  if  X  =  1,  we  will  have  that  Y  —  0  since  Y  =  ±1  with  equal 
probability.  To  quantify  these  relationships  we  form  the  product  XY,  which  can 
take  on  the  values  +1,  —1,  and  ±1  for  the  joint  PMFs  of  Figures  7.6a,  7.6b,  and 
7.6c,  respectively.  To  determine  the  value  of  XY  on  the  average  we  define  the  joint 
moment  as  Ex,y[XY].  From  (7.29)  this  is  evaluated  as 

Ex,Y [XY]  =  ^2  Y,  xiyjPx,Y [** >  Vj]  ■  (7 -35) 

*  3 

The  reader  should  compare  the  joint  moment  with  the  usual  moment  for  a  single 
random  variable  Ex[X]  =  YlixiPx[xi\-  For  the  joint  PMFs  of  Figure  7.6  the  joint 
moment  is 

2  2 

Ex,y  [XY]  =  Xiyjpx,Y  [xi ,  Vj] 

i= 1  j= 1 

=  (l)(l)l  +  (-l)(-l)l  =  l  (for  PMF  of  Figure  7.6a) 

£  Li 

=  (l)(-l)l  +  (-l)(l)l  =  -l  (for  PMF  of  Figure  7.6b) 

&  La 

—  (1)(— 1)^  +  (1)(1)^  =  0  (for  PMF  of  Figure  7.6c) 

Li  Li 

as  we  might  have  expected. 

In  Figure  7.6a  note  that  Ex[X]  =  Ey\Y]  =  0.  If  they  are  not  zero,  as  for  the 
joint  PMF  shown  in  Figure  7.7  in  which  Ex,y[XY]  =  2,  then  the  joint  moment  will 


Figure  7.7:  Joint  PMF  for  nonzero  means  with  equally  probable  outcomes. 

depend  on  the  values  of  the  means.  It  is  seen  that  even  though  the  relationship 
Y  —  X  is  preserved,  the  joint  moment  has  changed.  To  nullify  this  effect  of  having 
nonzero  means  influence  the  joint  moment  it  is  more  convenient  to  use  the  joint 
central  moment 


Ex,y[(X  ~  Ex[X]){Y  -  Ey[Y))] 


(7.36) 
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which  will  produce  the  desired  +1  for  the  joint  PMF  of  Figure  7.7.  This  quantity 
is  recognized  as  the  covariance  of  X  and  Y  so  that  we  denote  it  by  cov(X,  Y).  As 
we  have  just  seen,  the  covariance  may  be  positive,  negative,  or  zero.  Note  that  the 
covariance  is  a  measure  of  how  the  random  variables  covary  with  respect  to  each 
other.  If  they  vary  in  the  same  direction,  i.e.,  both  positive  or  negative  at  the  same 
time,  then  the  covariance  will  be  positive.  If  they  vary  in  opposite  directions,  the 
covariance  will  be  negative.  This  explains  why  var(X  +  Y)  may  be  greater  than 
var(X)  +  var(Y),  for  the  case  of  a  positive  covariance.  Similarly,  the  variance  of 
the  sum  of  the  random  variables  will  be  less  than  the  sum  of  the  variances  if  the 
covariance  is  negative. 

If  X  and  Y  are  independent  random  variables,  then  from  (7.31)  we  have 

co v(X,Y)  =  Ex,y[(X  —  Ex[X])(Y  —  Ey[Y])\ 

=  Ex[X  -  Ex[X]]Ey[Y  -  Ey[Y ]]  =  0.  (7.37) 

Hence,  independent  random  variables  have  a  covariance  of  zero.  This  also  says  that 
for  independent  random  variables  the  variance  of  the  sum  of  random  variables  is  the 
sum  of  the  variances ,  i.e.,  var(X  +  Y)  =  var(X)  +  var(Y)  (see  (7.33)).  However,  the 
covariance  may  still  be  zero  even  if  the  random  variables  are  not  independent  -  the 
converse  is  not  true.  Some  other  properties  of  the  covariance  are  given  in  Problem 
7.34. 


A 


Independence  implies  zero  covariance  but  zero  covariance  does 


not  imply  independence. 


Consider  the  joint  PMF  which  assigns  equal  probability  of  1/4  to  each  of  the  four 
points  shown  in  Figure  7.8.  The  joint  and  marginal  PMFs  are  listed  in  Table  7.6. 


y 


Figure  7.8:  Joint  PMF  of  random  variables  having  zero  covariance  but  that  are 
dependent. 

For  this  joint  PMF  the  covariance  is  zero  since 

*W  =  -i(i)+o(i)+i(I)=o 
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1-1 

II 
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o 

II 
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II 

Px[i\ 

i  =  -1 

o 

o 

1 

4 

*  =  0 

in1 

4  u  4 

1 

2 

i  =  1 

o 

1 — 1 

o 

1 

4 

Py  [j] 

1  1  1 

4  2  4 

Table  7.6:  Joint  PMF  values. 


and  thus  from  (7.34) 

cov(X,y)  =  EXtY[XY] 

l  l 

=  Y2  uvxAhi)  =  0 

i=-lj=-l 

since  either  x  or  y  is  always  zero.  However,  X  are  Y  are  dependent  because 
Px,y[  1,0]  =  1/4  but  Px[1]py[0]  =  (1/4) (1/2)  =  1/8.  Alternatively,  we  may  ar¬ 
gue  that  the  random  variables  must  be  dependent  since  Y  can  be  predicted  from  X. 
For  example,  if  X  =  1,  then  surely  we  must  have  Y  —  0. 

A 

More  generally  the  joint  k-lth  moment  is  defined  as 

ex,y[Ay1}  =  '£Y,xiyljPxAxi,yj]  (7-38) 

*  3 

for  k  =  1,2,...;/  =  1,2,...,  when  it  exists.  The  joint  k-l th  central  moment  is 
defined  as 


Ex,y[(X  -  Ex[X])k(Y  -  Ey[Y})1)  =  £  -  £x[X])fc(yj  -  SyM)W,yN, %] 

i  3 

(7.39) 


for  =  1, 2, . . . ;  l  =  1, 2, . . .,  when  it  exists. 


7.9  Prediction  of  a  Random  Variable  Outcome 

The  covariance  between  two  random  variables  has  an  important  bearing  on  the 
predictability  of  Y  based  on  knowledge  of  the  outcome  of  X.  We  have  already  seen 
in  Figures  7.6a,b  that  Y  can  be  perfectly  predicted  from  X  as  Y  =  X  (see  Figure 
7.6a)  or  as  Y  =  —X  (see  Figure  7.6b).  These  are  extreme  cases.  More  generally,  we 
seek  a  predictor  of  Y  that  is  linear  (actually  affine)  in  X  or 

Y  =  aX  +  b 
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where  the  “hat”  indicates  an  estimator.  The  constants  a  and  b  are  to  be  chosen  so 

/V 

that  “on  the  average”  the  observed  value  of  Y ,  which  is  ax  +  b  if  the  experimental 
outcome  is  (#,  y),  is  close  to  the  observed  value  of  Y,  which  is  y.  To  determine  these 
constants  we  shall  adopt  as  our  measure  of  closeness  the  mean  square  error  (MSE) 
criterion  described  previously  in  Example  6.3.  It  is  given  by 

mse(a,  b)  =  Ex,y[{Y  -  Y)2].  (7.40) 


Note  that  since  the  predictor  Y  depends  on  X,  we  need  to  average  with  respect 

A 

to  X  and  Y.  Previously,  we  let  Y  =  b,  not  having  the  additional  information  of 
the  outcome  of  another  random  variable.  It  was  found  in  Example  6.3  that  the 
optimal  value  of  b,  i.e.,  the  value  that  minimized  the  MSE,  was  60pt  =  Ey[Y]  and 
therefore  Y  =  Ey[Y].  Now,  however,  we  presume  to  know  the  outcome  of  X.  With 
the  additional  knowledge  of  the  outcome  of  X  we  should  be  able  to  find  a  better 
predictor.  To  find  the  optimal  values  of  a  and  b  we  minimize  (7.40)  over  a  and  b. 
Before  doing  so  we  simplify  the  expression  for  the  MSE.  Starting  with  (7.40) 


ms e(a,b)  =  Ex,y[{Y  -  aX  -  b)2] 

=  Ex,y[{Y  -aX)2  -2b(Y  -aX)  +  b2] 

=  EXjy[Y2  —  2aXY  +  a2X2  —  2bY  +  2abX  +  b2] 

=  Ey  [Y2]  -  2 aEXtY  [XT]  +  a2Ex  [X2]  -  2 bEY  [V]  +  2abEx  [X]  +  b2 . 


To  find  the  values  of  a  and  b  that  minimize  the  function  mse(o,  b) ,  we  determine  a 
stationary  point  by  partial  differentiation.  Since  the  function  is  quadratic  in  a  and 
6,  this  will  yield  the  minimizing  values  of  a  and  b.  Using  partial  differentiation  and 
setting  each  partial  derivative  equal  to  zero  produces 

=  -2  Ex>y[XY]  +  2  aEx[X2}  +  2  bEx[X]  =  0 
=  -2Ey[Y]  +  2aEx[X]  +  2b  =  0 

and  rearranging  yields  the  two  simultaneous  linear  equations 

Ex[X2]a  +  Ex[X]b  =  Ex,y[XY] 

Ex[X]a  +  b  =  Ey[Y], 


dms  e(a?  b ) 
da 

5ms  e(a,  b ) 
db 


The  solution  is  easily  shown  to  be 


aopt 


Exy  [XY]  -  Ex  [X]Ey \Y]  cov(X,  Y) 

Ex[X*]-E\[X)  var(X) 

=  Ey[Y]  ~  aoptEx[X]  =  Ey[Y]  -  C°v(f^)Ex[X] 

var(X ) 
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so  that  the  optimal  linear  prediction  of  Y  given  the  outcome  X  =  x  is 


Y 


—  a0ptx  +  &opt 

covlX,  Y) 

=  - +  Ey[Y] 

var(X) 


cov(X,  Y) 
var(.X') 


Ex[X] 


or  finally 


Y  =  Ey[Y]  +  (X  -  EX[X\).  (7.41) 


Note  that  we  refer  to  Y  =  aX  +  6  as  a  predictor  but  Y  =  ax  +  b  as  the  prediction, 
which  is  the  value  of  the  predictor.  As  expected,  the  prediction  of  Y  based  on  X  =  x 

A 

depends  on  the  covariance.  In  fact,  if  the  covariance  is  zero,  then  Y  =  Ey\Y],  which 
is  the  best  linear  predictor  of  Y  without  knowledge  of  the  outcome  of  X.  In  this 
case,  X  provides  no  information  about  Y.  An  example  follows. 

Example  7.13  -  Predicting  one  random  variable  outcome  from  knowledge 
of  second  random  variable  outcome 

Consider  the  joint  PMF  shown  in  Figure  7.9a  as  solid  circles  where  all  the  outcomes 
are  equally  probable.  Then,  Sx,y  =  {(0, 0),  (1, 1),  (2, 2),  (2,  3)}  and  the  marginals 
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(a)  Nonstandardized  X  and  Y 


(b)  Standardized  X  and  Y 


Figure  7.9:  Joint  PMF  (shown  as  solid  circles  having  equal  probabilities)  and  best 
linear  prediction  of  Y  when  X  =  x  is  observed  (shown  as  dashed  line). 
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are  found  by  summing  along  each  direction  to  yield 


/ 


px[i]  =  < 


Pr[j] 


< 


V 


1 

4 

1 

4 

1 

2 

1 

4 

1 

4 

1 

4 

1 

4 


i  =  0 

t  =  1 

i  =  2 

J=0 

i  =  i 
j  =  2 
J  =  3- 


As  a  result,  we  have  from  the  marginals  that  Ex[X]  =  5/4,  Ey  [Y]  =  3/2,  Ex[X 2]  = 
9/4,  and  var(X)  =  £x[X2]  -  £&[-*]  =  9/4  -  (5/4)2  =  11/16.  From  the  joint  PMF 
we  find  that  Ex,y[XY ]  =  (0) (0)1/4  +  (l)(l)l/4  +  (2)(2)l/4  +  (2)(3)l/4  =  11/4, 
which  results  in  cov(X,  Y)  =  Exx[XY]  -  Ex[X]Ey[Y]  =  11/ A-  (5/4)  (3/2)  =  7/8. 
Substituting  these  values  into  (7.41)  yields  the  best  linear  prediction  of  Y  as 


Y 


3  7/8  /  5\ 

2  +  11/16  (*  “  4  J 

14  1 

—  £  —  — 


which  is  shown  in  Figure  7.9a  as  the  dashed  line.  The  line  shown  in  Figure  7.9a  is 
referred  to  as  a  regression  line  in  statistics.  What  do  you  think  would  happen  if  the 
probability  of  (2, 3)  were  zero,  and  the  remaining  three  points  had  probabilities  of 
1/3? 

0 

The  reader  should  be  aware  that  we  could  also  have  predicted  X  from  Y  =  y 
by  interchanging  X  and  Y  in  (7.41).  Also,  we  note  that  if  cov(X,Y)  =  0,  then 

A 

Y  =  Ey[Y]  or  X  =  x  provides  no  information  to  help  us  predict  Y.  Clearly,  this 
will  be  the  case  if  X  and  Y  are  independent  (see  (7.37))  since  independence  of  two 
random  variables  implies  a  covariance  of  zero.  However,  even  if  the  covariance  is 
zero,  the  random  variables  can  still  be  dependent  (see  Figure  7.8)  and  so  prediction 
should  be  possible.  This  apparent  paradox  is  explained  by  the  fact  that  in  this 
case  we  must  use  a  nonlinear  predictor,  not  the  simple  linear  function  aX  +  b  (see 
Problem  8.27). 

The  optimal  linear  prediction  of  (7.41)  can  also  be  expressed  in  standardized 
form.  A  standardized  random  variable  is  defined  to  be  one  for  which  the  mean  is 
zero  and  the  variance  is  one.  An  example  would  be  a  random  variable  that  takes 
on  the  values  ±1  with  equal  probability.  Any  random  variable  can  be  standardized 
by  subtracting  the  mean  and  dividing  the  result  by  the  square  root  of  the  variance 
to  form 

_X-Ex[X] 

y/vax(X) 
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(see  Problem  7.42).  For  example,  if  X  Pois(A),  then  X,  =  (X  -  A)/V%  which  is 
easily  shown  to  have  a  mean  of  zero  and  a  variance  of  one.  We  next  seek  the  best 
linear  prediction  of  the  standardized  Y  based  on  a  standardized  X  =  x.  To  do  so  we 
define  the  standardized  predictor  based  on  a  standardized  Xs  =  xs  as 


>  Y-Ey[Y] 
y'Var(Y) 

Then  from  (7.41),  we  have 

Y  -  Ey[Y]  _  co v(X,y)  x-Ex[X\ 
y/var(T)  >/var(y)var(X)  ^/var(X) 

and  therefore 

y  COv(Jf,  r ) 

^/var(X)var(y) 


(7.42) 


Example  7.14  -  Previous  example  continued 

For  the  previous  example  we  have  that 


and 


so  that 


x  —  5/4 

”  TTf/ie 

jr  _  Y-  3/2 

cov(A:,  Y)  _  7/8 

v/var(X)var(Y)  1/(ll/16)(5/4) 


ys  =  0.94rr5 


and  is  displayed  in  Figure  7.9b. 


A 


The  factor  that  scales  xs  to  produce  Ys  is  denoted  by 


cov(X,  Y) 

Px  v  —  —  - 

^var(X)var(y) 


❖ 

(7.43) 


and  is  called  the  correlation  coefficient.  When  X  and  Y  have  pxy  0,  then  X  and 
y  are  said  to  be  correlated.  If,  however,  the  covariance  is  zero  and  hence  pxy  —  0, 
then  the  random  variables  are  said  to  be  uncorrelated .  Clearly,  independent  ran¬ 
dom  variables  are  always  uncorrelated,  but  not  the  other  way  around.  Using  the 
correlation  coefficient  allows  us  to  express  the  best  linear  prediction  in  its  standard- 
ized  form  as  Ys  —  pxyxs •  The  correlation  coefficient  has  an  important  property 
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in  that  it  is  always  less  than  one  in  magnitude.  In  the  previous  example,  we  had 
px,y  «  0.94. 


Property  7.7  —  Correlation  coefficient  is  always  less  than  or  equal  to  one 
in  magnitude  or  \px,y\  —  1- 

Proof:  The  proof  relies  on  the  Cauchy- Schwarz  inequality  for  random  variables. 

This  inequality  is  analogous  to  the  usual  one  for  the  dot  product  of  Euclidean  vectors 
v  and  w,  which  is 

v  •  wl  <  llvll  llw 


where  ||v||  denotes  the  length  of  the  vector.  Equality  holds  if  and  only  if  the  vectors 
are  collinear.  Collinearity  means  that  w  =  cv  for  c  a  constant  or  the  vectors  point  in 
the  same  direction.  For  random  variables  V  and  W  the  Cauchy-Schwarz  inequality 
says  that 

\EVjW[VW]\  <  y/Ev [V2] V Ew [" W 2]  (7.44) 


with  equality  if  and  only  if  W  =  cV  for  c  a  constant.  See  Appendix  7A  for  a 
derivation.  Thus  letting  V  —  X  —  Ex[X]  and  W  =  Y  —  Ey[Y],  we  have 


Px,y 


cov(X,Y) 

^/var(X)var(T) 

\Ey,w[VW}\  ^ 

y/E^*\y/ Ew[W*]  - 


using  (7.44).  Equality  will  hold  if  and  only  if  W  =  cV  or  equivalently  if  Y  —  Ey  [Y]  = 
c(X  —  Ex[X]),  which  is  easily  shown  to  imply  that  (see  Problem  7.45) 


Pxy 


1  if  Y  —  aX  +  b  with  a  >  0 
—  1  if  Y  =  aX  +  b  with  a  <  0 


for  a  and  b  constants. 


□ 

Note  that  when  px,Y  =  ±1,  Y  can  be  perfectly  predicted  from  X  by  using  Y  = 
aX  +  b.  See  also  Figures  7.6a  and  7.6b  for  examples  of  when  px,Y  —  +1  and 
Px,y  =  —  1,  respectively. 


/  M  \  Correlation  between  random  variables  does  not  imply  a  causal 
relationship  between  the  random  variables. 

A  frequent  misapplication  of  probability  is  to  assert  that  two  quantities  that  are 
correlated  (px,Y  /  0)  are  such  because  one  causes  the  other.  To  dispel  this  myth 
consider  a  survey  in  which  all  individuals  older  than  55  years  of  age  in  the  U.S.  are 
asked  whether  they  have  ever  had  prostate  cancer  and  also  their  height  in  inches. 
Then,  for  each  height  in  inches  we  compute  the  average  number  of  individuals  per 
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50  55  60  65  70  75  80  85  90 

Height  (inches) 


Figure  7.10:  Incidence  of  prostate  cancer  per  1000  individuals  older  than  age  55 
versus  height. 

1000  who  have  had  cancer.  If  we  plot  the  average  number,  also  called  the  incidence 
of  cancer,  versus  height,  a  typical  result  would  be  as  shown  in  Figure  7.10.  This 
indicates  a  strong  positive  correlation  of  cancer  with  height.  One  might  be  tempted 
to  conclude  that  growing  taller  causes  prostate  cancer.  This  is  of  course  nonsense. 
What  is  actually  shown  is  that  segments  of  the  population  who  are  tall  are  associated 
with  a  higher  incidence  of  cancer.  This  is  because  the  portion  of  the  population  of 
individuals  who  are  taller  than  the  rest  are  predominately  male.  Females  are  not 
subject  to  prostate  cancer,  as  they  have  no  prostates!  In  summary,  correlation 
between  two  variables  only  indicates  an  association ,  i.e.,  if  one  increases,  then  so 
does  the  other  (if  positively  correlated).  No  physical  or  causal  relationship  need 
exist. 


7.10  Joint  Characteristic  Functions 

The  characteristic  function  of  a  discrete  random  variable  was  introduced  in  Section 
6.7.  For  two  random  variables  we  can  define  a  joint  characteristic  function.  For  the 
random  variables  X  and  Y  it  is  defined  as 


<^x,y(wx,wy)  =  Ex,Y[exp\j(u}xX  +  wyy)]]. 


(7.45) 
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Assuming  both  random  variables  take  on  integer  values,  it  is  evaluated  using  (7.29) 
as 

OO  OO 

<i Px,y(ux,uy )  =  Px,Y[k,l]exp[j(uxk  +  UY1)\.  (7.46) 

k=— oo  l— — oc 


It  is  seen  to  be  the  two-dimensional  Fourier  transform  of  the  two-dimensional  se¬ 
quence  px,y[k,l]  (note  the  use  of  +j  as  opposed  to  the  more  common  —j  in  the 
exponential).  As  in  the  case  of  a  single  random  variable,  the  characteristic  function 
can  be  used  to  find  moments.  In  this  case,  the  joint  moments  are  given  by  the 
formula 


EXy[XmYn ] 


1  dm+ncf)XtY(ux,uY) 


;m+n 


doj^duiy 


UJX  —OJy  =0 


(7.47) 


In  particular,  the  first  joint  moment  is  found  as 


Ex,y[XY }  =  - 


dbJ  x  dujy 


ljx=wy=0 


Another  important  application  is  to  finding  the  PMF  for  the  sum  of  two  independent 
random  variables.  This  application  is  based  on  the  result  that  if  X  and  Y  are 
independent  random  variables,  the  joint  characteristic  function  factors  due  to  the 
property  Ex,y[g(X)h(Y)\  —  Ex[g{X)\Ey[h(Y)\  (see  (7.31)).  Before  deriving  the 
PMF  for  the  sum  of  two  independent  random  variables,  we  prove  the  factorization 
result,  and  then  give  a  theoretical  application.  The  factorization  of  the  characteristic 
function  follows  as 


<t>X,y(wXyUy)  = 


oo  oo 

E  E  Px,y[k,  l]  exp  [j  (ojx  k  +  coYl)] 

k— — oo  1——OC 
oo  oo 

^  ^  Px[k]py[l\  exp[juxk]  expfjcjyZ]  (joint  PMF  factors) 

k=— oo  l=— oo 

oo  oo 

Y,  Px[k]exp[juxk]  Py[1]  exp[juYl] 

k=— oo  l=— oo 

4>x{wx)4>y(wy)-  (definition  of  characteristic  function  (7.48) 

for  single  random  variable). 


The  converse  is  also  true — if  the  joint  characteristic  function  factors,  then  X  and 

Y  are  independent  random  variables.  This  can  easily  be  shown  to  follow  from  the 
inverse  Fourier  transform  relationship.  As  an  application  of  the  converse  result, 
consider  the  tranformed  random  variables  W  =  g(X)  and  Z  =  h(Y),  where  X  and 

Y  are  independent.  We  prove  that  W  and  Z  are  independent  as  well,  which  is  to 
say  functions  of  independent  random  variables  are  independent.  To  do  so  we  show 
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that  the  joint  characteristic  function  factors.  The  joint  characteristic  function  of  the 
transformed  random  variables  is 


te(wiv,wz)  =  Ew,  z{exp[j  (ujwW  +  uzZ)]}. 


But  we  have  that 


4>W,z{uw,Wz)  = 


Ex,Y[^xp\j{oJwg{X)  +  cozh(Y)]\  (slight  extension  of  (7.28)) 

Ex[exp(juJwg{X))]EY [exp(juzh(Y))\  (same  argument  as  used  to 

yield  (7.31)) 

Ew[exp(iuwW)\Ez[exp(jwzZ)\  (from  (6.5)) 

(j)\Y  (co'vv )  <f>z  {^z)  (definition) 


and  hence  W  and  Z  are  independent  random  variables.  As  a  general  result,  we  can 
now  assert  that  if  X  and  Y  are  independent  random  variables,  then  so  are  g ( X ) 
and  h(Y)  for  any  functions  g  and  h. 

Finally,  consider  the  problem  of  determining  the  PMF  for  Z  —  X  +  Y,  where  X 
and  Y  are  independent  random  variables.  We  have  already  solved  this  problem  using 
the  joint  PMF  approach  with  the  final  result  given  by  (7.22).  By  using  characteristic 
functions,  we  can  simplify  the  derivation.  The  derivation  proceeds  as  follows. 

(f>z(wz)  =  Ez[exp(jujzZ)]  (definition) 

=  Ex^i^vU^ziX  +  T)]  (from  (7.28)  and  (7.29)) 

=  Ex,Y[exp(jojzX)  exp(jujzY)} 

=  Ex[exp(juzX)]EY[exp(ju)ZY)]  (from  (7.31)) 

=  <Px{^z)4>y{^z)- 


To  find  the  PMF  we  take  the  inverse  Fourier  transform  of  <pz(coz),  replacing  ojz  by 
the  more  usual  notation  cu,  to  yield 


Pz 


4>x{u)4>y{u)  exp(-juk) 


du 

2-k 


=  Px\i]PY[k-i] 

2—  —  OO 


which  agrees  with  (7.22).  The  last  result  follows  from  the  property  that  the  Fourier 
transform  of  a  convolution  sum  is  the  product  of  the  Fourier  transforms  of  the 
individual  sequences. 


7.11  Computer  Simulation  of  Random  Vectors 

The  method  of  generating  realizations  of  a  two-dimensional  discrete  random  vector 
is  nearly  identical  to  the  one-dimensional  case.  In  fact,  if  X  and  Y  are  independent, 
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then  we  generate  a  realization  of  X ,  say  x *,  according  to  px  [xi\  and  a  realization  of  Y , 
say  j/j,  according  to  using  the  method  of  Chapter  5.  Then  we  concatenate  the 

realizations  together  to  form  the  realization  of  the  vector  random  variable  as  (a:*,  yj). 
Furthermore,  independence  reduces  the  problems  of  estimating  a  joint  PMF,  a  joint 
CDF,  etc.  to  that  of  the  one-dimensional  case.  The  joint  PMF,  for  example,  can  be 
estimated  by  first  estimating  px[%i\  as  fix[xi]i  then  estimating  py[Vj]  as  PY[yj\,  and 
finally  forming  the  estimate  of  the  joint  PMF  as  px,y[^uyj\  ~  Px[xi]PY[yj]- 

When  the  random  variables  are  not  independent,  we  need  to  generate  a  realiza¬ 
tion  of  (X,Y)  simultaneously  since  the  value  obtained  for  X  is  dependent  on  the 
value  obtained  for  Y  and  vice  versa.  If  both  Sx  and  Sy  are  finite,  then  a  simple 
procedure  is  to  consider  each  possible  realization  ( x^yj )  as  a  single  outcome  with 
probability  Px,Y[xi,Vj\-  Then,  we  can  apply  the  techniques  of  Section  5.9  directly. 
An  example  is  given  next. 

Example  7.15  -  Generating  realizations  of  jointly  distributed  random 
variables 

Assume  a  joint  PMF  as  given  in  Table  7.7.  A  simple  MATLAB  program  to  generate 


i  =  o  3  =  1 

cS> .  <S2 , 

II  II 

H- *  O 

1  1 

8  8 

1  1 

4  2 

Table  7.7:  Joint  PMF  values  for  Example  7.15. 
a  set  of  M  realizations  of  (X,  Y)  is  given  below. 


for  m=l:M 
u=rand(l , 1) ; 
if  u<=l/8 

x(m,l)=0;y(m,l)=0; 
elseif  u>l/8&u<=l/4 
x(m,l)=0;y(m,l)=l; 
elseif  u>l/4&u<=l/2 
x(m, 1)=1 ;y(m, 1)=0; 
else 

x(m,l)=l;y(m,l)=l; 

end 

end 


Once  the  realizations  are  available  we  can  estimate  the  joint  PMF  and  marginal 
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PMFs  as 

.  .  Number  of  outcomes  equal  to  (i,j)  . 

Px,y[l3\  =  - -  *  =  0,l;j=0,l 

Px[i]  =  Px,y[i,0]  +px,Y[i,  1]  *  =  0,1 

Py  [ j ]  =  Px,y  [0,  j]  +  Px,y  [1 ,  j]  j  =  0,1 

and  the  joint  moments  are  estimated  as 

___ _ __  1  M 

EX,Y[XkYl )  =  M  E  xkmVlm 

m— 1 

where  (xm,ym)  is  the  mth  realization.  Other  quantities  of  interest  are  discussed  in 
Problems  7.49  and  7.51. 

❖ 


7.12  Real-World  Example  -  Assessing  Health  Risks 


An  increasingly  common  health  problem  in  the  United  States  is  obesity.  It  has  been 
found  to  be  associated  with  many  life-threatening  illnesses,  especially  diabetes.  One 
way  to  define  what  constitutes  an  obese  person  is  via  the  body  mass  index  (BMI) 
[CDC  2003].  It  is  computed  as 


BMI  = 


703 W 
H 2 


(7.49) 


where  W  is  the  weight  of  the  person  in  pounds  and  H  is  the  person’s  height  in  inches. 
BMIs  greater  than  25  and  less  than  30  are  considered  to  indicate  an  overweight 
person,  and  30  and  above  an  obese  person  [CDC  2003].  It  is  of  great  importance  to 
be  able  to  estimate  the  PMF  of  the  BMI  for  a  population  of  people.  For  example, 
in  Chapter  4  we  displayed  a  table  of  the  joint  probabilities  of  heights  and  weights 
for  a  hypothetical  population  of  college  students.  For  this  population  we  would 
like  to  know  the  probability  or  percentage  of  obese  persons.  This  percentage  of  the 
population  would  then  be  at  risk  for  developing  diabetes.  To  do  so  we  could  first 
determine  the  PMF  of  the  BMI  and  then  determine  the  probability  of  a  BMI  of  30 
and  above.  From  Table  4.1  or  Figure  7.1  we  have  the  joint  PMF  for  the  random 
vector  (if,  W ).  To  find  the  PMF  for  the  BMI  we  note  that  it  is  a  function  of  H  and 
W  or  in  our  previous  notation,  we  wish  to  determine  the  PMF  of  Z  —  g (A,  T),  where 
Z  denotes  the  BMI,  X  denotes  the  height,  and  Y  denotes  the  weight.  The  solution 
follows  immediately  from  (7.26).  One  slight  modification  that  we  must  make  in 
order  to  fit  the  data  of  Table  4.1  into  our  theoretical  framework  is  to  replace  the 
height  and  weight  intervals  by  their  midpoint  values.  For  example,  in  Table  4.1  the 
probability  of  observing  a  person  with  a  height  between  5*8"  and  6'  and  a  weight  of 
between  130  and  160  lbs.  is  0.06.  We  convert  these  intervals  so  that  we  can  say  that 


7.12.  REAL-WORLD  EXAMPLE  -  ASSESSING  HEALTH  RISKS 


203 


the  probability  of  a  person  having  a  height  of  5'  10"  and  a  weight  of  145  lbs.  is  0.06. 
Next  to  determine  the  PMF  we  first  find  the  BMI  for  each  height  and  weight  using 
(7.49),  rounding  the  result  to  the  nearest  integer.  This  is  displayed  in  Table  7.8. 


Wi 

115 

w2 

145 

w3 

175 

W4 

205 

w5 

235 

Hi  5' 2" 

21 

27 

32 

37 

43 

H2  5' 6" 

19 

23 

28 

33 

38 

H3  5' 10" 

16 

21 

25 

29 

34 

Hi  6' 2" 

15 

19 

22 

26 

30 

H5  6' 6" 

13 

17 

20 

24 

27 

Table  7.8:  Body  mass  indexes  for  heights  and  weights  of  hypothetical  college  stu¬ 
dents. 


BMI 


Figure  7.11:  Probability  mass  function  for  body  mass  index  of  hypothetical  college 
population. 

Then,  we  determine  the  PMF  by  using  (7.26).  For  example,  for  a  BMI  =  21,  we 
require  from  Table  7.8  the  entries  (H,W)  —  (5’ 2”, 115)  and  (iJ,  W)  =  (SIO^MS). 
But  from  Table  4.1  we  see  that 

P[H  =  5' 2", W  =  115]  =  0.08 
P[H  =  5>lti'  ,W  =  145]  =  0.06 

and  therefore  P[BMI  =  21]  =  0.14.  The  other  values  of  the  PMF  of  the  BMI 
are  found  similarly.  This  produces  the  PMF  shown  in  Figure  7.12.  It  is  seen  that 
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the  probability  of  being  obese  as  defined  by  the  BMI  (BMI  >  30)  is  0.08.  Stated 
another  way  8%  of  the  population  of  college  students  are  obese  and  so  are  at  risk 
for  diabetes. 
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Problems 

7.1  (w)  A  chess  piece  is  placed  on  a  chessboard,  which  consists  of  an  8  x  8  array 

of  64  squares.  Specify  a  numerical  sample  space  Sx,y  for  the  location  of  the 
chess  piece. 

7.2  (w)  Two  coins  are  tossed  in  succession  with  a  head  being  mapped  into  a  +1  and 

a  tail  being  mapped  into  a  —1.  If  a  random  vector  is  defined  as  (X,  Y)  with 
X  representing  the  mapping  of  the  first  toss  and  Y  representing  the  mapping 
of  the  second  toss,  draw  the  mapping.  Use  Figure  7.2  as  a  guide.  Also,  what 
is  <Sx,y? 

7.3  (^)  (w)  A  woman  has  a  penny,  a  nickel,  and  a  dime  in  her  pocket.  If  she 

chooses  two  coins  from  her  pocket  in  succession,  what  is  the  sample  space  S 
of  possible  outcomes?  If  these  outcomes  are  next  mapped  into  the  values  of 
the  coins,  what  is  the  numerical  sample  space  <Sjv,y? 

7.4  (w)  If  Sx  =  {1, 2}  and  Sy  =  {3,4},  plot  the  points  in  the  plane  comprising 

Sx,y  =  Sx  x  Sy-  What  is  the  size  of  <Sx,y? 

7.5  (w)  Two  dice  are  tossed.  The  number  of  dots  observed  on  the  dice  are  added 

together  to  form  the  random  variable  X  and  also  differenced  to  form  Y.  De¬ 
termine  the  possible  outcomes  of  the  random  vector  (X,  Y)  and  plot  them  in 
the  plane.  How  many  possible  outcomes  are  there? 

7.6  (f)  A  two-dimensional  sequence  is  given  by 

Px,Y[i,j]  =  c(  1  -Pi)*(l  -P2)3  i  =  1,2,...  ;j  -  1,2,... 

where  0<pi<l,0<p2<l3  and  c  is  a  constant.  Find  c  to  make  px,Y  a 
valid  joint  PMF. 
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7.7(f)  Is 

Px,y[i,j] 

a  valid  joint  PMF? 

7.8  (^)  (w)  A  single  coin  is  tossed  twice.  A  head  outcome  is  mapped  into  a  1  and 

a  tail  outcome  into  a  0  to  yield  a  numerical  outcome.  Next,  a  random  vector 
(X,  Y)  is  defined  as 

X  —  outcome  of  first  toss  +  outcome  of  second  toss 
Y  —  outcome  of  first  toss  —  outcome  of  second  toss. 

Find  the  joint  PMF  for  (X,  Y),  assuming  the  outcomes  (xi,yj)  are  equally 
likely. 

7.9  (f)  Find  the  joint  PMF  for  the  experiment  described  in  Example  7.1.  Assume 

each  outcome  in  S  is  equally  likely.  How  can  you  check  your  answer? 

7.10  (0)  (f)  The  sample  space  for  a  random  vector  is  Sx,y  —  {(i,j)  :  i  =  1,2, 3, 4, 5; 
j  =  1,2, 3, 4}.  If  the  outcomes  are  equally  likely,  find  P[(X,  Y)  £  A],  where 
A  =  {(iJ)  :l<i<2;3<j<4}. 

7.11  (f)  A  joint  PMF  is  given  as  px,y[hj\  —  (1/2)Z+J  for  i  —  1,2, ...  \j  —  1,2, _ 

If  A  =  {(ij)  :  1  <  i  <  3;  j  >  2},  find  P[A]. 

7.12  (f )  The  values  of  a  joint  PMF  are  given  in  Table  7.9.  Determine  the  marginal 
PMFs. 


j  =  0  j  =  1  j  =  2 

i  =  0 

i  =  1 

i  =  2 

1  o  i 

8  u  4 

Oil 

u  8  4 

I  0  1 

8  u  8 

Table  7.9:  Joint  PMF  values  for  Problem  7.12. 


7.13  (o)  (f)  If  a  joint  PMF  is  given  by 

Px,y[i,j]  =P2(  1  - p)l+3~ 2  i  =  1,2  =  1,2,... 

find  the  marginal  PMFs. 

7.14(f)  If  a  joint  PMF  is  given  by  px,y[hj]  =  1/36  for  i  =  1,2,3, 4, 5, 6; j  = 
1, 2, 3, 4, 5, 6,  find  the  marginal  PMFs. 
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7.15  (w)  A  joint  PMF  is  given  by 


Px,r[i,j ] 


where  c  is  some  unknown  constant.  Find  c  so  that  the  joint  PMF  is  valid  and 
then  determine  the  marginal  PMFs.  Hint:  Recall  the  binomial  PMF. 


7.16  (o)  (w)  Find  another  set  of  values  for  the  joint  PMF  that  will  yield  the  same 
marginal  PMFs  as  given  in  Table  7.2. 


7.17  (t)  Prove  Properties  7.3  and  7.4  for  the  joint  CDF  by  relying  on  the  standard 
properties  of  probabilities  of  events. 


7.18  (w)  Sketch  the  joint  CDF  for  the  joint  PMF  given  in  Table  7.2.  Do  this  by 
shading  each  region  in  the  x-y  plane  that  has  the  same  value. 


7.19  (o)  (w)  A  joint  PMF  is  given  by 


3  (i,j)  =  (0,0) 

I  (*\j)  =  (  1, 

3  (*,;)  =  (1,0) 

l  (i,j)  =  (  1,-1) 

Are  X  and  Y  independent? 

7.20  (t)  Prove  that  if  the  random  variables  X  and  Y  are  independent,  then  the 
joint  CDF  factors  as  Fx,y(x,y)  =  Fx(x)Fy(y ). 

7.21  (t)  If  a  joint  PMF  is  given  by 

a  {i,j)  =  (0, 0) 
b  =  (  0,1) 
c  (t,j)  =  (l,0) 
d  (*,;)  =  ( l,l) 

where  of  course  we  must  have  a+b+c+d  =  1,  show  that  a  necessary  condition 
for  the  random  variables  to  be  independent  is  ad  =  be.  This  can  be  used  to 
quickly  assert  that  the  random  variables  are  not  independent  as  for  the  case 
shown  in  Table  7.5. 


PX,Y[hj]  =  \ 


Px,y[i,j]  =  { 


7.22(f)  If  X  Ber (px)  and  Y  ~  Ber(py),  and  X  and  Y  are  independent,  what  is 
the  joint  PMF? 
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7.23  (^)  (w)  If  the  joint  PMF  is  given  as 


Px,Y[hj] 


*  =  0, 1, . . . ,  10;  j  =0,1,...,  11 


are  X  and  Y  independent?  What  are  the  marginal  PMFs? 

7.24  (t)  Assume  that  X  and  Y  are  discrete  random  variables  that  take  on  all  integer 
values  and  are  independent.  Prove  that  the  PMF  of  Z  =  X  —  Y  is  given  by 


pz[l]  =  Yj  Px[k]py[k-l]  l  =  ...,-1,0, 1,..* 


k=— oo 


by  following  the  same  procedure  as  was  used  to  derive  (7.22).  Note  that  the 
transformation  from  (X,  Y)  to  ( W,Z )  is  one-to-one.  Next  show  that  if  X  and 
Y  take  on  nonnegative  integer  values  only,  then 

oo 

Pz[l]  -  ^2  Px[k]py[k-l]  l  =  ...,-1,0, 1,...  . 

k=max(0 ,1) 


7.25  (f)  Using  the  result  of  Problem  7.24  find  the  PMF  forZ  =  X-YifX~ 
Pois(Ax),  Y  ~  Pois(Ay),  and  X  and  Y  are  independent.  Hint:  The  result  will 
be  in  the  form  of  infinite  sums. 

7.26  (w)  Find  the  PMF  for  Z  =  max(X,  Y)  if  the  joint  PMF  is  given  in  Table  7.5. 

7.27  (^)  (f)  If  X  ~  Ber(l/2),  Y  ~  Ber(l/2),  and  X  and  Y  are  independent,  find 
the  PMF  for  Z  =  X  +  Y.  Why  does  the  width  of  the  PMF  increase?  Does 
the  variance  increase? 

7.28  (t)  Prove  that  Exy[g{X)]  =  Ex[g(X)].  Do  X  and  Y  have  to  be  independent? 

7.29  (t)  Prove  that 

Ex>Y[ag{X)  +  bh(Y)]  =  aEx[g(X )]  +  bEY[h{Y)\. 


7.30  (t)  Prove  (7.31). 

7.31  (t)  Find  a  formula  for  var(X  —  Y)  similar  to  (7.33).  What  can  you  say  about 
the  relationship  between  v&v(X  +  Y)  and  var(X  —  Y)  if  X  and  Y  are  uncor¬ 
related? 

7.32  (f)  Find  the  covariance  for  the  joint  PMF  given  in  Table  7.4.  How  do  you 
know  the  value  that  you  obtained  is  correct? 

7.33  (^)  (f)  Find  the  covariance  for  the  joint  PMF  given  in  Table  7.5. 
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7.34  (t)  Prove  the  following  properties  of  the  covariance: 

cov(X,Y)  -  ExAxy}-Ex[X}Ey[Y} 
cov(X,  X)  =  var(Y) 
co  v(y,x)  =  cov(x,y) 
co v(cX,Y)  =  c[cov(X,Y)] 
co v(X,cY)  =  c[cov(X,  Y)] 
co  v(Y,Y  +  Y)  =  cov(Y,Y)  +  cov(Y,Y) 
cov(Y  +  y,  X)  -  cov(Y,  X)  +  cov(Y,  X) 

for  c  a  constant. 

7.35  (t)  If  X  and  Y  have  a  covariance  of  cov(X,  Y),  we  can  transform  them  to  a 
new  pair  of  random  variables  whose  covariance  is  zero.  To  do  so  we  let 

W  —  X 
Z  =  aX  +  Y 

where  a  =  —  cov(X,  Y)/vai(X).  Show  that  cov(W,  Z)  =  0.  This  process  is 
called  decorrelating  the  random  variables.  See  also  Example  9.4  for  another 
method. 

7.36  (f)  Apply  the  results  of  Problem  7.35  to  the  joint  PMF  given  in  Table  7.5. 
Verify  by  direct  calculation  that  co v(W,  Z)  =  0. 

7.37  (^)  (f)  If  the  joint  PMF  is  given  as 

(l \i+j 

Px,  (o)  ^  5  ^  •  5  J  1,2,... 


compute  the  covariance. 

7.38  (0)(f)  Determine  the  minimum  mean  square  error  for  the  joint  PMF  shown 
in  Figure  7.9a.  You  will  need  to  evaluate  Ex,y[{Y  —  ((14/11)X  —  l/ll))2]. 

7.39  (t,f)  Prove  that  the  minimum  mean  square  error  of  the  optimal  linear  predic¬ 
tor  is  given  by 

msemin  —  Ex,y[(Y  (^o ptX  +  b0 pt))  ]  —  var(Y)  (l  Px,y)  • 

Use  this  formula  to  check  your  result  for  Problem  7.38. 

7.40  (0)  (w)  In  this  problem  we  compare  the  prediction  of  a  random  variable  with 
and  without  the  knowledge  of  a  second  random  variable  outcome.  Consider 
the  joint  PMF  shown  below.  First  determine  the  optimal  linear  prediction  of  Y 
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T— H 

II 

o 

II 

i  =  0 

*  =  1 

1  1 

8  4 

1  3 

4  8 

Table  7.10:  Joint  PMF  values  for  Problem  7.40. 

without  any  knowledge  of  the  outcome  of  X  (see  Section  6.6).  Also,  compute 
the  minimum  mean  square  error.  Next  determine  the  optimal  linear  prediction 
of  Y  based  on  the  knowledge  that  X  —  x  and  compute  the  minimum  mean 
square  error.  Plot  the  predictions  versus  x  in  the  plane.  How  do  the  minimum 
mean  square  errors  compare? 

7.41  (o)  (w,c)  F°r  the  joint  PMF  of  height  and  weight  shown  in  Figure  7.1  deter¬ 
mine  the  best  linear  prediction  of  weight  based  on  a  knowledge  of  height.  You 
will  need  to  use  Table  4.1  as  well  as  a  computer  to  carry  out  this  problem. 
Does  your  answer  seem  reasonable?  Is  your  prediction  of  a  person’s  weight  if 
the  height  is  70  inches  reasonable?  How  about  if  the  height  is  78  inches?  Can 
you  explain  the  difference? 

7.42  (f )  Prove  that  the  transformed  random  variable 

x- Ex  [X] 

v/var(X) 

has  an  expected  value  of  0  and  a  variance  of  1. 

7.43  (o)  (w)  The  linear  prediction  of  one  random  variable  based  on  the  outcome 
of  another  becomes  more  difficult  if  noise  is  present.  We  model  noise  as  the 
addition  of  an  uncorrelated  random  variable.  Specifically,  assume  that  we  wish 
to  predict  X  based  on  observing  X  +  AT,  where  N  represents  the  noise.  If  X 
and  N  are  both  zero  mean  random  variables  that  are  uncorrelated  with  each 
other,  determine  the  correlation  coefficient  between  W  —  X  and  Z  =  X  +  N. 
How  does  it  depend  on  the  power  in  X,  which  is  defined  as  Ex[X2],  and  the 
power  in  iV,  also  defined  as  En[N 2]? 

7.44  (w)  Consider  var(X  +  Y),  where  X  and  Y  are  correlated  random  variables. 
How  is  the  variance  of  a  sum  of  random  variables  affected  by  the  correlation 
between  the  random  variables?  Hint:  Express  the  variance  of  the  sum  of  the 
random  variables  using  the  correlation  coefficient. 

7.45  (f )  Prove  that  if  Y  —  aX  +  6,  where  a  and  b  are  constants,  then  pxy  —  1  if 
a  >  0  and  px,Y  =  — 1  if  a  <  0. 
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7.46  (0)  (w)  If  X  ~  Ber(l/2),  Y  ~  Ber(l/2),  and  X  and  Y  are  independent,  find 
the  PMF  for  Z  =  X  +  Y.  Use  the  characteristic  function  approach  to  do  so. 
Compare  your  results  to  that  of  Problem  7.27. 

7.47  (w)  Using  characteristic  functions  prove  that  the  binomial  PMF  has  the  re¬ 
producing  property.  That  is  to  say,  if  X  ~  bin (Mx,p),  Y  ~  bin (My,p),  and 
X  and  Y  are  independent,  then  Z  =  X  +  Y  ~  bin(Mx  +  My,p).  Why  does 
this  make  sense  in  light  of  the  fact  that  a  sequence  of  independent  Bernoulli 
trials  can  be  used  to  derive  the  binomial  PMF? 


7.48  (o)  (c)  Using  the  joint  PMF  shown  in  Table  7.7  generate  realizations  of  the 
random  vector  (X,  Y)  and  estimate  its  joint  and  marginal  PMFs.  Compare 
your  estimated  results  to  the  true  values. 

7.49  (o)  (c)  F°r  th e  joint  PMF  shown  in  Table  7.7  determine  the  correlation  coef¬ 
ficient.  Next  use  a  computer  simulation  to  generate  realizations  of  the  random 
vector  (X,  Y)  and  estimate  the  correlation  coefficient  as 


where 


Px,Y  = 


) 


m— 1 


and  (xm,ym)  is  the  rath  realization. 

7.50  (w,e)  If  X  ~  geom(p),  Y  ~  geom(p),  and  X  and  Y  are  independent,  show 
that  the  PMF  of  Z  —  X  +  Y  is  given  by 

Pz[k]  =p2(k  -  1)(1  -p)k~2  k  =  2,3, - 

To  avoid  errors  use  the  discrete  unit  step  sequence.  Next,  for  p  =  1/2  gen¬ 
erate  realizations  of  Z  by  first  generating  realizations  of  X,  then  generating 
realizations  of  Y  and  adding  each  pair  of  realizations  together.  Estimate  the 
PMF  of  Z  and  compare  it  to  the  true  PMF. 

7.51  (w,e)  Using  the  joint  PMF  given  in  Table  7.5  determine  the  covariance  to 
show  that  it  is  nonzero  and  hence  X  and  Y  are  correlated.  Next  use  the 
procedure  of  Problem  7.35  to  determine  transformed  random  variables  W  and 
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Z  that  are  uncorrelated.  Verify  that  W  and  Z  are  uncorrelated  by  estimating 
the  covariance  as 


_  i  M 

covW  Z)  =  -  •£ 


WmZm  -  WZ 


171—1 


where 


m— 1 


and  ( wm ,  zm)  is  the  rath  realization.  Be  sure  to  generate  the  realizations  of  W 
and  Z  as  wm  —  xm  and  zm  =  axm  +  ym,  where  (xm:  ym)  is  the  rath  realization 
of  (X,Y). 


Appendix  7 A 


Derivation  of  the 
Cauchy-Schwarz  Inequality 


The  Cauchy-Schwarz  inequality  was  given  by 

\EVjW[VW}\  <  tJEv[V2WEw[W*]  (7A.1) 

with  equality  holding  if  and  only  if  W  =  cV,  for  c  a  constant.  To  prove  this,  we 
first  note  that  for  all  a  ^  0  and  f3  ^  0 

Ev,w[(aV  ~  W2]  >  0.  (7A.2) 


If  we  let 

a  =  y/Ew[W2\ 
(3  =  y/Ev[V 2] 

then  we  have  that 


EvMiV Ew\W2]V  -  y/Ey^W)2]  >  0 
Ev>w[Ew[W2]V2  -  2y^w[VF2] y/EvlV^VW  +  Ev[V2]W 2]  >  0 
Ew[W2]Ev[V2]  -  2^Ew[W2WEv[V2}EVjw[VW]  +  Ev[V2]Ew[W2}  >  0 

since  Eyyw\g{W)\  =  Ew\g(W)\,  etc.  ,  which  results  in 


Ew[W2]Ev[V2}  -  y/Ew[W2} ^Ev[V2]Ev,w [VW]  >  0 


Dividing  by  Ew\W2]Ev\V2]  produces 


1 


Ev,w[VW] 


>  o 
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or  finally,  upon  rearranging  terms  we  have  that 

Ey,w[VW]  ^  1 
yj  Ev[V2]y/ Ew[W2)  ~ 

or 

EVjW[VW]  <  y/Ev[V2]y/Ew[W2]. 

By  replacing  the  negative  sign  in  (7A.2)  by  a  positive  sign  and  proceeding  in  an 
identical  manner,  we  will  obtain 

-Ev>w[VW]  <  yj Ev[V2W Ew[W2] 

and  hence  combining  the  two  results  yields  the  desired  inequality.  To  determine 
when  the  equal  sign  will  hold,  we  note  that 

Ev,w[{aV  -  (3W)2\  =  ^  y^javi  -  Pwj)2pv,w[vi >  wi\ 

Vi  Wj 

which  can  only  equal  zero  when  ( avi—/3wj )2  =  0  for  all  i  and  j  since  pv,w[vii  wj]  >  0. 
Thus,  for  equality  to  hold  we  must  have 


avi  =  /3wj  all  i  and  j 


which  is  equivalent  to  requiring 


aV  =  pW 


or  finally  dividing  by  /3  (asssumed  not  equal  to  zero),  we  obtain  the  condition  for 
equality  as 


for  c  a  constant. 
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Conditional  Probability  Mass 
Functions 

8.1  Introduction 


In  Chapter  4  we  discussed  the  concept  of  conditional  probability.  We  recall  that  a 
conditional  probability  P[A\B]  is  the  probability  of  an  event  A ,  given  that  we  know 
that  some  other  event  B  has  occurred.  Except  for  the  case  when  the  two  events  are 
independent  of  each  other,  the  knowledge  that  B  has  occurred  will  change  the  prob¬ 
ability  P[A}.  In  other  words,  P[A\B]  is  our  new  probability  in  light  of  the  additional 
knowledge.  In  many  practical  situations,  two  random  mechanisms  are  at  work  and 
are  described  by  events  A  and  B.  An  example  of  such  a  compound  experiment  was 
given  in  Example  4.2.  To  compute  probabilities  for  a  compound  experiment  it  is 
usually  convenient  to  use  a  conditioning  argument  to  simplify  the  reasoning.  For 
example,  say  we  choose  one  of  two  coins  and  toss  it  4  times.  We  might  inquire 
as  to  the  probability  of  observing  2  or  more  heads.  However,  this  probability  will 
depend  upon  which  coin  was  chosen,  as  for  example  in  the  situation  where  one  coin 
is  fair  and  the  other  coin  is  weighted.  It  is  therefore  convenient  to  define  conditional 
probability  mass  functions ,  px[k |coin  1  chosen]  and  px[k |coin  2  chosen],  since  once 
we  know  which  coin  is  chosen,  we  can  easily  specify  the  PMF.  In  particular,  for 
this  example  the  conditional  PMF  is  a  binomial  one  whose  value  of  p  depends  upon 
which  coin  is  chosen  and  with  k  denoting  the  number  of  heads  (see  (5.6)).  Once 
the  conditional  PMFs  are  known,  we  have  by  the  law  of  total  probability  (see  (4.4)) 
that  the  probability  of  observing  k  heads  for  this  experiment  is  given  by  the  PMF 


Px  [ k ]  = 


Px[k  |  coin  1  chosen]  P  [coin  1  chosen] 
+Px[k | coin  2  chosen] P [coin  2  chosen]. 
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Therefore,  the  desired  probability  of  observing  2.  or  more  heads  is 

4 

P[X>2]  =  Y,Px[k] 

k= 2 
4 

=  [A;  |  coin  1  chosen]  P  [coin  1  chosen] 

k= 2 

+  Px[k | coin  2  chosen] P [coin  2  chosen]). 

The  PMF  that  is  required  depends  directly  on  the  conditional  PMFs  (of  which  there 
are  two).  The  use  of  conditional  PMFs  greatly  simplifies  our  task  in  that  given 
the  event,  i.e.,  the  coin  chosen,  the  PMF  of  the  number  of  heads  observed  readily 
follows.  Also,  in  many  problems,  including  this  one,  it  is  actually  the  conditional 
PMFs  that  are  specified  in  the  description  of  the  experimental  procedure.  It  makes 
sense,  therefore,  to  define  a  conditional  PMF  and  study  its  properties.  For  the  most 
part,  the  definitions  and  properties  will  mirror  those  of  the  conditional  probability 
P[A\B],  where  A  and  B  are  events  defined  on  Sx,y • 

8.2  Summary 

The  utility  of  defining  a  conditional  PMF  is  illustrated  in  Section  8.3.  It  is  especially 
appropriate  when  the  experiment  is  a  compound  one,  in  which  the  second  part  of 
the  experiment  depends  upon  the  outcome  of  the  first  part.  The  definition  of  the 
conditional  PMF  is  given  in  (8.7).  It  has  the  usual  properties  of  a  PMF,  that  of 
being  between  0  and  1  and  also  summing  to  one.  Its  properties  and  relationships 
are  summarized  by  Properties  8. 1-8.5.  The  conditional  PMF  is  related  to  the  joint 
PMF  and  the  marginal  PMFs  by  these  properties.  They  are  also  depicted  in  Figure 
8.4  for  easy  reference.  If  the  random  variables  are  independent,  then  the  conditional 
PMF  reduces  to  the  usual  marginal  PMF  as  shown  in  (8.22).  For  general  probability 
calculations  based  on  the  conditional  PMF  one  can  use  (8.23).  In  Section  8.5  it  is 
shown  how  to  use  conditioning  arguments  to  simplify  the  derivation  of  the  PMF  for 
Z  =  g(X,Y).  The  PMF  can  be  found  using  (8.24),  which  makes  use  of  the  condi¬ 
tional  PMF.  In  particular,  if  X  and  Y  are  independent,  the  procedure  is  especially 
simplified  with  examples  given  in  Section  8.5.  The  mean  of  the  conditional  PMF  is 
defined  by  (8.30).  It  is  computed  by  the  usual  procedures  but  uses  the  conditional 
PMF  as  the  “averaging”  PMF.  It  is  next  shown  that  the  mean  of  the  unconditional 
PMF  can  be  found  by  averaging  over  the  means  of  the  conditional  PMFs  as  given 
by  (8.35).  This  simplifies  the  computation.  Generation  of  realizations  of  random 
vectors  (X,  Y)  can  be  simplified  using  conditioning  arguments.  An  illustration  and 
MATLAB  code  segment  is  given  in  Section  8.7.  Finally,  an  application  of  condi¬ 
tioning  to  the  modeling  of  human  learning  is  described  in  Section  8.8.  Utilizing  the 
posterior  PMF,  which  is  a  conditional  PMF,  one  can  demonstrate  that  “learning” 
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takes  place  as  the  result  of  observing  the  outcomes  of  repeated  experiments.  The 
degree  of  learning  is  embodied  in  the  posterior  PMF. 


8.3  Conditional  Probability  Mass  Function 

We  continue  with  the  introductory  example  to  illustrate  the  utility  of  the  conditional 
probability  mass  function.  Summarizing  the  introductory  problem,  we  have  an 
experimental  procedure  in  which  we  first  choose  a  coin,  either  coin  1  or  coin  2.  Coin 
1  has  a  probability  of  heads  of  p i,  while  coin  2  has  a  probability  of  heads  of  P2 .  Let 
X  be  the  discrete  random  variable  describing  the  outcome  of  the  coin  choice  so  that 

J  1  if  coin  1  is  chosen 
\  2  if  coin  2  is  chosen. 


Since  Sx  —  {1, 2},  we  assign  a  PMF  to  X  of 


Px[i \ 


a  i  —  1 

1  —  a  i  =  2 


where  0  <  a  <  1.  The  second  part  of  the  experiment  consists  of  tossing  the  chosen 
coin  4  times  in  succession.  Call  the  outcome  of  the  number  of  heads  observed 
as  Y  and  note  that  Sy  =  {0, 1,2, 3, 4}.  Hence,  the  overall  set  of  outcomes  of  the 
compound  experiment  is  Sx,y  —  Sx  x  <Sy,  which  is  shown  in  Figure  8.1.  The  overall 


y 


Figure  8.1:  Mapping  for  coin  toss  example,  x  denotes  the  coin  chosen  while  y  denotes 
the  number  of  heads  observed. 

outcome  is  described  by  the  random  vector  (X,T),  where  X  is  the  coin  chosen  and 
Y  is  the  number  of  heads  observed  for  the  4  coin  tosses.  If  we  wish  to  determine 
the  probability  of  2  or  more  heads,  then  this  is  the  probability  of  the  set  A  shown 
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in  Figure  8.1.  It  is  given  mathematically  as 

piA]  =  Y  Pxy[i,j] 

2  4 

=  Yi'yipxtY&fi-  (8*2) 

*= 1  3=2 

Hence,  we  need  only  specify  the  joint  PMF  to  determine  the  desired  probability.  To 
do  so  we  make  use  of  our  definition  of  the  joint  PMF  as  well  as  our  earlier  concepts 
from  conditional  probability  (see  Chapter  4).  Recall  from  Chapter  7  the  definitions 
of  the  joint  PMF  and  marginal  PMF  as 

Px,r[hj]  =  P[X  =  i,Y=j] 

Px[i\  =  P[X  =  i]. 

By  using  the  definition  of  conditional  probability  for  events  we  have 

Px,y[hj]  —  P[X  =  i,  Y  =  j]  (definition  of  joint  PMF) 

=  P[Y  —  j \X  =  i]P[X  =  i]  (definition  of  conditional  prob.) 

=  P[Y  =  j\X  =  i]px  W  (definition  of  marginal  PMF).  (8.3) 

From  (8.1)  we  have  px[i]  and  from  the  experimental  description  we  can  determine 
P[Y  —  j \X  —  i\.  When  X  =  1,  we  toss  a  coin  with  a  probability  of  heads  pi,  and 
when  X  =  2,  we  toss  a  coin  with  a  probability  of  heads  p2-  Also,  we  have  previously 
shown  that  for  a  coin  with  a  probability  of  heads  pi  that  is  tossed  4  times,  the 
number  of  heads  observed  has  a  binomial  PMF.  Thus,  for  i  =  1, 2 

P[Y  =  j\X  =  i]=P)  p>  (1  -  Pi)4-j  j  =  0, 1, 2, 3, 4.  (8.4) 

Note  that  the  probability  depends  on  the  outcome  X  =  i  via  pi.  Also,  for  a  given 
value  of  X  =  i,  the  probability  has  all  the  usual  properties  of  a  PMF.  These  prop¬ 
erties  are 


0  <  P[Y  =  j\X  =  i\<  1 

4 

YJP[Y  =  j\X  =  i]  =  l. 

3=0 

It  is  therefore  appropriate  to  define  P[Y  —  j \X  =  i\  as  a  conditional  PMF.  We  will 
denote  it  by 

Py\x[M  =  P[Y  =  j\x  =  i]  j  =  0, 1,2, 3, 4. 

Examples  are  plotted  in  Figure  8.2  for  pi  =  1/4  and  P2  =  1/2.  Returning  to  our 
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* 

3 


J 


(a)  i  =  1,  pi  =  1/4 


(b)  i  =  2,p2  =  1/2 


Figure  8.2:  Conditional  PMFs  given  by  (8.4). 


problem  we  can  now  determine  the  joint  PMF.  Using  (8.3)  we  have 


Px,r[i,j]  =PY\x\MPx\i] 


and  using  (8.4)  and  (8.1)  the  joint  PMF  is 


Pxy[i,j]  =  Pi(l -Pi)4  3a 


i  =  T,j  =  0,1,2, 3,4 


*4(1  -p2)*-j(  1  -a)  i  =  2;  j  =  0, 1,2, 3, 4. 


Finally  the  desired  probability  is  from  (8.2) 

4  4 

P[A)  =  y^Px,r[l)  j]  +  y^PA-.y[2,  j] 

J=2  i=2 

4  /  .  \  4 


=  Ja  +  2  f4)^(1  -^2)4  J(l-a) 

j— 2  j— 2  V-?/ 


(8.5) 


As  an  example,  if  pi  =  1/4  and  p2  —  3/4,  we  have  for  a  =  1/2,  that  P[A]  =  0.6055, 
but  if  a  =  1/8,  then  P[A]  =  0.8633.  Can  you  explain  this? 

Note  from  (8.5)  that  the  conditional  PMF  is  also  expressed  as 


Py \x[j\i] 


Px,y[i,j] 
Px[i ] 


(8.6) 
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and  is  only  a  renaming  for  the  conditional  probability  of  the  event  that  Aj  =  {5  : 
Y (5)  =  j}  given  that  Bt  =  {s  :  X (s)  =  i] ■  To  make  this  connection  we  have 

py\xm  =  p[y  =  j|-y  =  i]  =  F1^yJ-^il 

P[AjOBi] 

P[Bi] 

=  P[Aj\Bi } 

and  hence  Py\x  [ J  |  ]  is  a  conditional  probability  for  the  events  Aj  and  Bt. 


8.4  Joint,  Conditional,  and  Marginal  PMFs 


As  evidenced  by  (8.6),  there  are  relationships  between  the  joint,  conditional,  and 
marginal  PMFs.  In  this  section  we  describe  these  relationships.  To  do  so  we  rewrite 
the  definition  of  the  conditional  PMF  in  slightly  more  generality  as 


PY\x[Vj\Xi} 


PX,Y[xj,Vj] 
Px  [Xi\ 


for  a  sample  space  Sx,y  which  may  not  consist  solely  of  integer  two- tuples.  It  is 
always  assumed  that  px[%i]  ^  0-  Otherwise,  the  definition  does  not  make  any  sense. 
The  conditional  PMF,  although  appearing  to  be  a  function  of  two  variables,  x\  and 
yj ,  should  be  viewed  as  a  family  or  set  of  PMFs.  Each  PMF  in  the  family  is  a  valid 
PMF  when  X{  is  considered  to  be  a  constant.  In  the  example  of  the  previous  section, 
we  had  pY\x\j |1]  and  pY\x[j\^]-  The  family  is  therefore  {pY\x[j\l],PY\x[j\2}}  and 
each  member  is  a  valid  PMF,  whose  values  depend  on  j.  Hence,  we  would  expect 
that  (see  Problem  8.4) 


oo 

Py\xW)  =  1 

j=- oo 
oo 

Y,  py\xm  =  1 

j  —  -  OO 

but  not  Y^i^L-oo PY\x\j\i]  —  1  (see  also  Problem  4.9).  Before  proceeding  to  list  the 
relationships  between  the  various  PMFs,  we  give  an  example  of  the  calculation  of 
the  conditional  PMF  based  on  (8.7). 

Example  8.1  —  Two  dice  toss 

Two  dice  are  tossed  with  all  outcomes  assumed  to  be  equally  likely.  The  number  of 
dots  observed  on  each  die  are  added  together.  What  is  the  conditional  PMF  of  the 
sum  if  it  is  known  that  the  sum  is  even?  We  begin  by  letting  Y  be  the  sum  and  define 
X  =  1  if  the  sum  is  even  and  X  =  0  if  the  sum  is  odd.  Thus,  we  wish  to  determine 
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Py|xL?|1]  and  Py|x[i|0]  for  all  j.  The  sample  space  for  Y  is  Sy  =  {2,3,  ...,12} 
as  can  be  seen  from  Table  8.1,  which  lists  the  sum  of  the  two  dice  outcomes  as  a 
function  of  the  outcomes  for  each  die.  The  boldfaced  entries  are  the  ones  for  which 


j  =  1 

CM 

II 

II 

CO 

II 

3  =  5 

3  =  6 

i  -  1 

2 

3 

4 

5 

6 

7 

i  =  2 

3 

4 

5 

6 

7 

8 

i  =  3 

4 

5 

6 

7 

8 

9 

i  =  4 

5 

6 

7 

8 

9 

10 

i  =  5 

6 

7 

8 

9 

10 

11 

i  =  6 

7 

8 

9 

10 

11 

12 

Table  8.1:  The  sum  of  the  number  of  dots  observed  for  two  dice  -  boldface  indicates 
an  even  sum. 


the  sum  is  even  and  therefore  comprise  the  sample  space  for  Py\x[j\l]-  Note  that 
each  outcome  (i,j)  has  an  assumed  probability  of  occurring  of  1/36.  Now,  using 
(8.7) 


PY\x[j\l]  = 


Px,y[  l,j] 
Px[  1' 


j  =  2,4, 6, 8, 10, 12 


where  px,Y  [1,  j]  is  the  probability  of  the  sum  being  even  and  also  equaling  j.  Since  we 
assume  in  (8.8)  that  j  is  even  (otherwise  Pyixblll  =  0),  we  have  that  px,y[l,  j]  — 
py[j]  for  j  =  2,4,6,8,10,12.  Also,  there  are  18  even  outcomes,  which  results  in 
px[  1]  =  1/2.  Thus,  (8.8)  becomes 


Py  |xb'|l] 


Py[j] 

1/2 

^(1/36) 

1/2 


where  Nj  is  the  number  of  outcomes  in  Sx,y  for  which  the  sum  is  j.  From  Table 
8.1  we  can  easily  find  Nj  so  that 


/ 


Py\xW]  =  < 


v 


1 

18 

» 

3 

=  2 

3 

18 

• 

3 

=  4 

5 

18 

• 

3 

=  6 

5 

18 

3 

=  8 

3 

18 

3 

=  10 

1 

18 

• 

3 

=  12. 

(8.9) 
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Note  that  as  expected  X^Py|xL?|l]  =  1-  The  reader  is  asked  to  verify  by  a  similar 
calculation  that  (see  Problem  8.7) 


/ 


Py\xW\  =  { 


v 


(8.10) 


These  conditional  PMFs  are  shown  in  Figure  8.3.  Also,  note  that  py|xL?|0]  7^ 


Mean  of  conditional  PMF 


Mean  of  conditional  PMF 


Figure  8.3:  Conditional  PMFs  for  Example  8.1. 


1  —  Py;A'[il  !]■  Each  conditional  PMF  is  generally  different. 

0 

There  are  several  relationships  between  the  joint,  marginal,  and  conditional  PMFs. 
We  now  summarize  these  as  properties. 

Property  8.1  -  Joint  PMF  yields  conditional  PMFs. 

If  the  joint  PMF  px,y[xi,yj]  is  known,  then  the  conditional  PMFs  are  found  as 


PY\x[Vj\xi]  = 


PX,Y\Xi,Vj\ 

YjjPX,Y[xi,Vj] 

PxxjxuVj] 

HiPx,v{xi,yj}' 


(8.11) 


Px\y{xi\yj. 


(8.12) 
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Proof:  Since  the  marginal  PMF  px[xi]  is  found  as  ^2jPx,Y[xi,Vj],  the  denominator 
of  (8.7)  can  be  replaced  by  this  to  yield  (8.11).  The  equation  (8.12)  is  similarly 
proven. 

□ 

Hence,  we  see  that  the  conditional  PMF  is  just  the  joint  PMF  with  Xi  fixed  and  then 
normalized  by  YljPx,Y [%i,  Vj]  so  that  it  sums  to  one.  In  Figure  8.3a,  the  conditional 
PMF  py|x[j|l]  evaluated  at  j  =  8  is  just  px,Y [1  •  8]  =  5/36  divided  by  the  sum  of 
the  probabilities  px,y[l ,  ■]  =  18/36,  where  indicates  all  possible  values  of  j.  This 
yields  py|x[8|l]  =  5/18. 

Property  8.2  —  Conditional  PMFs  are  related. 


PX\Y[xi\Uj]  ~ 


PY\x[yj\xi]px[xj\ 

py[yj] 


Proof:  By  interchanging  X  and  Y  in  (8.7)  we  have 


PX\Y[xi\Vj\ 


PY,x[yjixi\ 

PY[Vi ] 


but 


PY,x[Vj,xi] 


P[Y  =  yj,X  =  Xi ] 

P[X  =  xi,Y  =  yj ]  (since  A  fl  B  =  B  fl  A) 

PX,Y[Xi,Vj] 


and  therefore 


PX\Y[xi\Vj\ 


Pxyjxj,  yj] 
PYiVj] 


(8.13) 


(8.14) 


Using  px,Y[xi,Vj] 
(8.13). 


PY\x[yj\xi]Px[xi]  from  (8.7)  in  (8.14)  yields  the  desired  result 


□ 


Property  8.3  —  Conditional  PMF  is  expressible  using  Bayes’  rule. 


PY\x[Vj\xi] 

Proof:  From  (8.11)  we  have  that 


Px\v[xz\yj}PY[yj] 

Hj  PX\Y[Xi\Vj]PY[Vj] 


PY\x[Vj\xi] 


Pxyjxuyj] 

E  jPX,Y[xi,Vj] 


and  using  (8.14)  we  have 


(8.15) 


(8.16) 


(8.17) 


PX,Y[Xi,yj]  =  Px\Y[xi\yj]PY[yj. 
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which  when  substituted  into  (8.16)  yields  the  desired  result. 

□ 

Property  8.4  -  Conditional  PMF  and  its  corresponding  marginal  PMF 
yields  the  joint  PMF. 


Px,v[xi,yj]  =  PY\x[yj\xi\Px[xi]  (8.18) 

Px,Y[xi,yj]  =  Px\Y[xi\yj]PY[yj]  (8.19) 

Proof:  (8.18)  follows  from  definition  of  conditional  PMF  (8.7)  and  (8.19)  is  just 
(8.17). 

□ 


Property  8.5  —  Conditional  PMF  and  its  corresponding  marginal  PMF 
yields  the  other  marginal  PMF. 


py[yj]  =  ^2pY\x[yj\xi]px[xi]  (8.20) 

i 

Proof:  This  is  just  the  law  of  total  probability  in  disguise  or  equivalently  just 
py[yj\  =  J2iPx,y[xi,yj]  (marginal  PMF  from  joint  PMF). 

□ 

These  relationships  are  summarized  in  Figure  8.4.  Notice  that  the  joint  PMF  can 
be  used  to  find  all  the  marginals  and  conditional  PMFs  (see  Figure  8.4a).  The 
conditional  PMF  and  its  corresponding  marginal  PMF  can  be  used  to  find  the 
joint  PMF  (see  Figure  8.4b).  Finally,  the  conditional  PMF  and  its  corresponding 
marginal  PMF  can  be  used  to  find  the  other  conditional  PMF  (see  Figure  8.4c).  As 
emphasized  earlier,  we  cannot  determine  the  joint  PMF  from  the  marginals.  This 
is  only  possible  if  X  and  Y  are  independent  random  variables  since  in  this  case 


PX,Y[Xi,Vj]  =  Px[xi]pY[yj]- 


(8.21) 


In  addition,  for  independent  random  variables,  the  use  of  (8.21)  in  (8.7)  yields 


PY\x[Vj\xi] 


Px[xi]PY[y3] 

Px  [Xi] 


=  PY[yj\ 


(8.22) 


or  the  conditional  PMF  is  the  same  as  the  unconditional  PMF.  There  is  no  change 
in  the  probabilities  of  Y  whether  or  not  X  is  observed.  This  is  of  course  consistent 
with  our  previous  definition  of  statistical  independence. 

Finally,  for  more  general  conditional  probability  calculations  we  sum  the  appro¬ 
priate  values  of  the  conditional  PMF  to  yield  (see  Problem  8.14) 


P[Y  e  A\x  =  xi]  =  ^2  pY\x[yj\xi]. 

{j-yjeA} 


(8.23) 
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Px  =  Yy  Px,y 


Py  =  J2x  Px,y 


Px 


PY 


(c)  (Can  also  interchange  X 
and  Y  for  similar  results) 


Figure  8.4:  Conditional  PMF  relationships. 


8.5  Simplifying  Probability  Calculations  using 
Conditioning 

As  alluded  to  in  the  introduction,  conditional  PMFs  can  be  used  to  simplify  prob¬ 
ability  calculations.  To  illustrate  the  use  of  this  approach  we  once  again  consider 
the  determination  of  the  PMF  for  Z  =  X  +  Y,  where  X  and  Y  are  independent 
discrete  random  variables  that  take  on  integer  values.  We  have  already  seen  that  the 
solution  is  pz  =  Px*Py ,  where  ★  denotes  discrete  convolution  (see  (7.22)).  To  solve 
this  problem  using  conditional  PMFs,  we  ask  ourselves  the  question:  Could  I  find 
the  PMF  of  Z  if  X  were  known?  If  so,  then  we  should  be  able  to  use  conditioning 
arguments  to  first  find  the  conditional  PMF  of  Z  given  X,  and  then  uncondition 
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the  result  to  yield  the  PMF  of  Z.  Let  us  say  that  X  is  known  and  that  X  =  i.  As 
a  result,  we  have  that  conditionally  Z  =  i  +  Y,  where  i  is  just  a  constant.  This  is 
sometimes  denoted  by  Z \(X  =  i).  But  this  is  a  transformation  from  one  discrete 
random  variable  Y  to  another  discrete  random  variable  Z.  We  therefore  wish  to 
determine  the  PMF  of  a  random  variable  that  has  been  summed  with  a  constant 
It  is  not  difficult  to  show  that  if  a  discrete  random  variable  U  has  a  PMF  pu\j\, 
then  U  +  i  has  the  PMF  pjj[j  —  i\  or  the  PMF  is  just  shifted  to  the  right  by  i  units. 
Thus,  the  conditional  PMF  of  Z  evaluated  at  Z  =  j  is  Pz\x[jV\  —  PY\x[j  Now 
to  find  the  unconditional  PMF  of  Z  we  use  (8.20)  with  an  appropriate  change  of 
variables  to  yield 

oo 

Pz[j]  =  Tj  Pz\x V I *1p* W 

i— — oo 

and  since  Pz\x[j\i]  —  Py\xU  ~  ^],  we  ^ave 

oo 

Pz[j]  =  PY\x[j  ~  i\i]Px[i]- 

i=— oo 

But  X  and  Y  are  independent  so  that  py\x  : 
result 

oo 

pz\j\  =  53  py^ 

i=— oo 

which  agrees  with  our  earlier  one.  Another  example  follows. 

Example  8.2  —  PMF  for  Z  —  max(A,  Y) 

Let  X  and  Y  be  discrete  random  variables  that  take  on  integer  values.  Also,  assume 
independence  of  the  random  variables  X  and  Y  and  that  the  marginal  PMFs  of  X 
and  Y  are  known.  To  find  the  PMF  of  Z  we  use  (8.20)  or  the  law  of  total  probability 
to  yield 

oo 

Pz[k]  =  ]T  Pz\x[k\i]px[i\-  (8.24) 

i— — oo 

Now  px  is  known  so  that  we  only  need  to  determine  Pz\x  f°r  X  —  i.  But  given  that 
X  —  i,  we  have  that  Z  —  max(i,T)  for  which  the  PMF  is  easily  found.  We  have 
thus  reduced  the  original  problem,  which  is  to  determine  the  PMF  for  the  random 
variable  obtained  by  transforming  from  (A,  Y)  to  Z,  to  determining  the  PMF  for 
a  function  of  only  one  random  variable.  Letting  g(Y )  =  max(i,  Y)  we  see  that  the 
function  appears  as  shown  in  Figure  8.5.  Hence,  using  (5.9)  for  the  PMF  of  a  single 
transformed  discrete  random  variable  we  have 

Pz\x[k\i]  =  pY\xH  I*]- 


-  py  and  therefore  we  have  the  final 
-i)px\i] 
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g{y)  =  ma x(i,y) 


y 


Figure  8.5:  Plot  of  the  function  g(y)  =  max(?',  y). 


Solving  for  j  in  g(j)  —  k  (refer  to  Figure  8.5)  yields  no  solution  for  k  <  i,  the 
multiple  solutions  j  =  . . .  ,i  —  1,  *  for  k  =  i,  and  the  single  solution  j  =  k  for 
k  =  i  +  l,i  +  2, _ This  produces 


Pz\x[k\i\ 


0  k 

Yllj=-ooPY\x[j\i]  k 
PY\x[k\i] 


. . .  ,i  —  2,  i  —  1 


(8.25) 


Using  this  in  (8.24)  produces 


fc-i 


oo 


Pz[k]  =  ^2  Pz\x[k\i]px\i)+Pz\x[k\k]px[k]+  ^  pz\x[k\i]Px\i]  (break  up  sum) 

i— — oo 

k— 1  k 

=  J2  PY\x[k\i]Px[i]  +  PY\x\j\k]px[k]  +0  (use  (8.25)) 

i— — oo  j=—oo 

k— 1  k 

=  Yj  PY[k]px[i]  +  Py  [j]px  [k]  (since  X  and  Y  are  independent) 

j=- oo 


i=—oo 


k-l 


k 


PY[k]  Y  Px[i\+Px[k]  Y  PY ' 


l—  —  OO 


3--oo 


Note  that  due  to  the  independence  assumption  this  final  result  can  also  be  written 
as 

k-l  k 

pz[k]=  Y  PxAhk\+  Y  Px,Y[k>j\ 

i— — oo  j—~  oo 

so  that  the  PMF  of  Z  is  obtained  by  summing  all  the  points  of  the  joint  PMF 
shown  in  Figure  8.6  for  k  =  2,  as  an  example.  These  point  comprise  the  set  {{x,y)  : 
ma x(x,y)  =  2  and  x  =  i,y  =  j}.  It  is  now  clear  that  we  could  have  solved  this 
problem  in  a  more  direct  fashion  by  making  this  observation.  As  in  most  problems, 
however,  the  solution  is  usually  trivial  once  it  is  known! 
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y 


Figure  8.6:  Points  of  joint  PMF  to  be  summed  to  find  PMF  of  Z  —  max(X,  Y)  for 
k  =  2. 


0 

As  we  have  seen,  a  general  procedure  for  determining  the  PMF  for  Z  =  g(X,Y) 
when  X  and  Y  are  independent  is  as  follows: 

1.  Fix  X  —  Xi  and  let  Z\(X  =  X{)  —  g(x{,Y) 

2.  Find  the  PMF  for  Z\X  by  using  the  techniques  for  a  transformation  of  a  single 

random  variable  Y  into  another  random  variable  Z.  The  formula  is  from  (5.9), 
where  the  PMFs  are  first  converted  to  conditional  PMFs 


PZ\x[Zk\Xi] 


Y  PY\x[yj\xi] 

{j-g(xi,yj)=zk} 

Y 

{j--9(xi,yj)=zk} 


for  each  xi 


for  each  xi  (due  to  independence). 


3.  Uncondition  the  conditional  PMF  to  yield  the  desired  PMF 

Pz[zk ]  =  Tjpz\x[zk\xi}px[xi]- 

i 


In  general,  to  compute  probabilities  of  events  it  is  advantageous  to  use  a  condi¬ 
tioning  argument,  whether  or  not  X  and  Y  are  independent.  Where  previously  we 
have  used  the  formula 

p[YeA}=  Y  PY[yj\ 

{j-yj€A} 

to  compute  the  probability,  a  conditioning  approach  would  replace  py  [yj]  by 
T,iPY\x[Vj\xi]Px[xi]  to  yield 

P[Y  e  A]  =  Y  YPY\x[yj\xi]Px[xi] 

{j-yj£A}  i 


(8.26) 
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to  determine  the  probability.  Equivalently,  we  have  that 


P[Y  eA]  =  Y, 


X  PY\x[yj\xi\ 

{j-yjeA} 


Pxfo] 

J  unconditioning 


(8.27) 


. . . V . 

conditioning 


In  this  form  we  recognize  the  conditional  probability  of  (8.23),  which  is 

P[Y  G  A\X  =  Xi]=  X  PY\x[yj\xi ] 

{j-yjtA} 


and  the  unconditional  probability 

P[Y  e  A]  =  X  P[Y  e  A\X  =  Xilpx  [Xi]  (8.28) 

i 

with  the  latter  being  just  a  restatement  of  the  law  of  total  probability. 

8.6  Mean  of  the  Conditional  PMF 

Since  the  conditional  PMF  is  a  PMF,  it  exhibits  all  the  usual  properties.  In  particu¬ 
lar,  we  can  determine  attributes  such  as  the  expected  value  of  a  random  variable  Y, 
when  it  is  known  that  X  —  X{.  This  expected  value  is  the  mean  of  the  conditional 
PMF  py\x-  Its  definition  is  the  usual  one 

Xw’mfoM  (8-29) 

3 

where  we  have  replaced  py  by  py\x-  It  should  be  emphasized  that  since  the  con¬ 
ditional  PMF  depends  on  X{ ,  so  will  its  mean.  Hence,  the  mean  of  the  conditional 
PMF  is  a  constant  when  we  set  X{  equal  to  a  fixed  value.  We  adopt  the  notation  for 
the  mean  of  the  conditional  PMF  as  Ey\x\Y\xi]-  This  notation  includes  the  sub¬ 
script  “Y\X”  to  remind  us  that  the  averaging  PMF  is  the  conditional  PMF  py\x- 
Also,  the  use  of  “Y \x{”  as  the  argument  will  remind  us  that  the  averaging  PMF  is 
the  conditional  PMF  that  is  specified  by  X  =  xi  in  the  family  of  conditional  PMFs. 
The  mean  is  therefore  defined  as 


EY\x[Y\xi]  =  XWAM-  (8-30) 

3 

Although  we  have  previously  asserted  that  the  mean  is  a  constant,  here  it  is  to  be 
regarded  as  a  function  of  X{.  An  example  of  its  calculation  follows. 
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Example  8.3  —  Mean  of  conditional  PMF  —  continuation  of  Example  8.1 

We  now  compute  all  the  possible  values  of  Ey\x\X\xi\  f°r  the  problem  described 
in  Example  8.1.  There  X{  —  1  or  X{  —  0  and  the  corresponding  conditional  PMFs 
are  given  by  (8.9)  and  (8.10),  respectively.  The  means  of  the  conditional  PMFs  are 
therefore 


and  are  shown  in  Figure  8.3.  In  this  example  the  means  of  the  conditional  PMFs 
are  the  same,  but  will  not  be  in  general.  We  can  expect  that  g(xi)  =  Ey\x[Y\xi] 
will  vary  with  X{. 

0 

We  could  also  compute  the  variance  of  the  conditional  PMFs.  This  would  be 

var(y|a;i)  =  (yj  -  EY\x\X\xi}f  pY\x[vMi\-  (8.31) 

3 

The  reader  is  asked  to  do  this  in  Problem  8.22.  (See  also  Problem  8.23  for  an 
alternate  expression  for  var(Y|a^).)  Note  from  Figure  8.3  that  we  do  not  expect 
these  to  be  the  same. 


What  is  the  “conditional  expectation”? 


The  function  g(xi)  —  Ey\x\X\xi\  1S  the  mean  of  the  conditional  PMF  Py\x[Vj\xi\- 
Alternatively,  it  is  known  as  the  conditional  mean.  This  terminology  is  widespread 
and  so  we  will  adhere  to  it,  although  we  should  keep  in  mind  that  it  is  meant  to 
denote  the  usual  mean  of  the  conditional  PMF.  It  is  also  of  interest  to  determine 
the  expectation  of  other  quantities  besides  Y  with  respect  to  the  conditional  PMF. 
This  is  called  the  conditional  expectation  and  is  symbolized  by  EY\x[9(Y)\xi]-  The 
latter  is  called  the  conditional  expectation  of  g(Y).  For  example,  if  g(Y)  =  Y2, 
then  it  becomes  the  conditional  expectation  of  Y2  or  equivalently  the  conditional 
second  moment  Lastly,  the  reader  should  be  aware  that  the  conditional  mean  is  the 
optimal  predictor  of  a  random  variable  based  on  observation  of  a  second  random 
variable  (see  Problem  8.27). 


We  now  give  another  example  of  the  computation  of  the  conditional  mean. 


Example  8.4  -  Toss  one  of  two  dice. 

There  are  two  dice  having  different  numbers  of  dots  on  their  faces.  Die  1  is  the 
usual  type  of  die  with  faces  having  1,2, 3, 4, 5,  or  6  dots.  Die  2  has  been  mislabled 
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with  its  faces  having  2, 3, 2, 3, 2,  or  3  dots.  A  die  is  selected  at  random  and  tossed. 
Each  face  of  the  die  is  equally  likely  to  occur.  What  is  the  expected  number  of  dots 
observed  for  the  tossed  die?  To  solve  this  problem  first  observe  that  the  outcomes 
will  depend  upon  which  die  has  been  tossed.  As  a  result,  the  conditional  expectation 
of  the  number  of  dots  will  depend  upon  which  die  is  initially  chosen.  We  can  view 
this  problem  as  a  conditional  one  by  letting 

Y  _  f  1  if  die  1  is  chosen 
\  2  if  die  2  is  chosen 

and  Y  is  the  number  of  dots  observed.  Thus,  we  wish  to  determine  £V|xP^|l]  and 
Ey\x[Y\2\.  But  if  die  1  is  chosen,  the  conditional  PMF  is 

Py\xW]  =  \  3  =  1,2, 3, 4, 5, 6  (8.32) 

and  if  die  2  is  chosen 

Py|xL?|2]  =  -  j  =  2,3.  (8.33) 

The  latter  conditional  PMF  is  due  to  the  fact  that  for  die  2  half  the  sides  show  2 
dots  and  the  other  half  of  the  sides  show  3  dots.  Using  (8.30)  with  (8.32)  and  (8.33), 
we  have  that 

6  7 

EY\x\ym  =  jpy  ix  bii]  —  o 

3= 1 

eY\x[y\2]  =  =  (8-34) 

3= 2 

An  example  of  typical  outcomes  for  this  experiment  is  shown  in  Figure  8.7.  For 
50  trials  of  the  experiment  Figure  8.7a  displays  the  outcomes  for  which  die  1  was 
chosen  and  Figure  8.7b  displays  the  outcomes  for  which  die  2  was  chosen.  It  is 
interesting  to  note  that  the  estimated  mean  for  Figure  8.7a  is  3.88  and  for  Figure 
8.7b  it  is  2.58.  Note  that  from  (8.34)  the  theoretical  conditional  means  are  3.5  and 
2.5,  respectively. 

0 

In  the  previous  example,  we  have  determined  the  conditional  means,  which  are  the 
means  of  the  conditional  PMFs.  We  also  might  wish  to  determine  the  unconditional 
mean ,  which  is  the  mean  of  Y.  This  is  the  number  of  dots  observed  as  a  result 
of  the  overall  experiment,  without  first  conditioning  on  which  die  was  chosen.  In 
essence,  we  wish  to  determine  Ey[Y].  Intuitively,  this  is  the  average  number  of 
dots  observed  if  we  combined  Figures  8.7a  and  8.7b  together  (just  overlay  Figure 
8.7b  onto  Figure  8.7a)  and  continued  the  experiment  indefinitely.  Hence,  we  wish 
to  determine  Ey[Y]  for  the  following  experiment: 
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(a)  Outcomes  when  die  1  chosen  (b)  Outcomes  when  die  2  chosen 

Figure  8.7:  Computer  simulated  outcomes  of  randomly  selected  die  toss  experiment. 

1.  Choose  die  1  or  die  2  with  probability  of  1/2. 

2.  Toss  the  chosen  die. 

3.  Count  the  number  of  dots  on  the  face  of  tossed  die  and  call  this  the  outcome  of 

the  random  variable  Y. 

A  simple  MATLAB  program  to  simulate  this  experiment  is  given  as 

for  m=l:M 

if  rand(l,l)<0.5 

y  (m,  l)=PMFdata(l ,  [1  2  3  4  5  6]  ’ ,  [1/6  1/6  1/6  1/6  1/6  1/6]*); 
else 

y  (m, l)=PMFdata(l ,  [2  3] ' ,  [1/2  1/2]’); 

end 

end 

where  the  subprogram  PMFdata.m  is  listed  in  Appendix  6B.  After  the  code  is  ex¬ 
ecuted  there  is  an  array  y,  which  is  M  x  1,  containing  M  realizations  of  Y.  By 
taking  the  sample  mean  of  the  elements  in  the  array  y,  we  will  have  estimated 
Ey[Y].  But  we  expect  about  half  of  the  realizations  to  have  used  the  fair  die  and 
the  other  half  to  use  the  mislabled  die.  As  a  result,  we  might  suppose  that  the 
unconditional  mean  is  just  the  average  of  the  two  conditional  means.  This  would  be 
(1/2)  (7/2)  +  (1/2)  (5/2)  =  3,  which  turns  out  to  be  the  true  result.  This  conjecture 
is  also  strengthened  by  the  results  of  Figure  8.7.  By  overlaying  the  plots  we  have 
50  outcomes  of  the  experiment  for  which  the  sample  mean  is  3.25.  Let’s  see  how  to 
verify  this. 
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To  determine  the  theoretical  mean  of  Y,  i.e.,  the  unconditional  mean,  we  will 
need  py  [])  .  But  given  the  conditional  PMF  and  the  marginal  PMF  we  know  from 
Figure  8.4c  that  the  joint  PMF  can  be  found.  Hence,  from  (8.32)  and  (8.33)  and 
px[i]  =  1/2  for  i  =  1, 2,  we  have 


Px,y[i,j]  =  PY\x[j\i}Px  [*] 

{  h  <  =  i;i 

1  \  <  =  2;i 


1,2, 3, 4, 5, 6 


2,3. 


To  find  py\j]  we  use 


PY  [j]  = 


2=1 


Px,y[IJ]  =  J2 
PX,y[  l,j]  +PX,y[  2J] 


j  =  1,4, 5,6 
^  +  i  =  i  j  =  2, 3. 


Thus,  the  unconditional  mean  becomes 


J=1 


+  2 


+  3 


+  4 


+  5 


+  6 


This  value  is  sometimes  called  the  unconditional  expectation.  Note  that  for  this 
example,  we  have  upon  using  (8.34) 

Ey[Y]  —  -EV|x[^|l]Px[l]  +  Fy\X[Y\2]px  [2] 

or  the  unconditional  mean  is  the  average  of  the  conditional  means.  This  is  true  in 
general  and  is  summarized  by  the  relationship 

Ey[Y]  =  Y^EY\x[Y\xi]px[xi}.  (8.35) 

2 

To  prove  this  relationship  is  straightforward.  Starting  with  (8.35)  we  have 
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^2ey\ x[Y\xi]px[xi} 


(  ^2yjPY\x[yj\Xi]  \  Px[Xi] 
Px,y[xi,yj ]  r  ! 

Vj - r^\  Px  M 

PX  [Xi] 

^Vj^Px,Y[xuVi\ 


i  3 


^2vjPY[yj] 

3 

Ey[Y}. 


(definition  of  conditional  mean) 
(definition  of  conditional  PMF) 


(marginal  PMF  from  joint  PMF) 


In  (8.35)  we  can  consider  g(x{ )  =  EY\x\X\xi]  as  the  transformed  outcome  of  the 
coin  choice  part  of  the  experiment,  where  X  =  X{  is  the  outcome  of  the  coin  choice. 
Since  before  we  choose  the  coin  to  toss,  we  do  not  know  which  one  it  will  be,  we 
can  consider  g{X )  as  a  transformed  random  variable  whose  values  are  g(xi).  By  this 
way  of  viewing  things,  we  can  define  a  random  variable  as  g(X)  =  Ey\x\Y\X]  and 
therefore  rewrite  (8.35)  as 

Ey[Y]  =  Ex\g(X )] 

or  explicitly  we  have  that 


Ey[Y]  =  Ex[Ey1x[Y\X}\.  (8.36) 

In  effect,  we  have  computed  the  expectation  of  a  random  variable  in  two  steps. 
Step  1  is  to  compute  a  conditional  expectation  EY\x  while  step  2  is  to  undo  the 
conditioning  by  averaging  the  result  with  respect  to  the  PMF  of  X.  An  example  is 
the  previous  coin  tossing  experiment.  The  utility  in  doing  so  is  that  the  conditional 
PMFs  were  easily  found  and  hence  also  the  means  of  the  conditional  PMFs,  and 
finally  the  averaging  with  respect  to  px  is  easily  carried  out  to  yield  the  desired 
result.  We  illustrate  the  use  of  (8.36)  with  another  example. 

Example  8.5  -  Random  number  of  coin  tosses 

An  experiment  is  conducted  in  which  a  coin  with  a  probability  of  heads  p  is  tossed 
M  times.  However,  M  is  a  random  variable  with  M  ~  Pois(A).  For  example,  if  a 
realization  of  M  is  generated,  say  M  —  5,  then  the  coin  is  tossed  5  times  in  succes¬ 
sion.  We  wish  to  determine  the  average  number  of  heads  observed.  Conditionally 
on  knowing  the  value  of  M,  we  have  a  binomial  PMF  for  the  number  of  heads  Y. 
Hence,  for  M  —  i  we  have  upon  using  the  binomial  PMF  (see  (5.6)) 


Py\m[M  = 


,i;i  =  0,1 


5 


P*(l -pY  3 


3  =  0, 1, 


*  •  • 


•  •  •  • 
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Now  using  (8.36)  and  replacing  X  with  M  we  have 

Ey[Y]  =  Em[Ey\m[Y\M}\ 

and  for  a  binomial  PMF  we  know  that  Ey\m\Y\i\  —  W  so  that 

Ey[Y]  =  EM\Mp\  =  PEm[M], 

But  for  a  Poisson  random  variable  Em[M]  =  A,  which  yields  the  final  result 

Ey[Y]  =  A p. 

It  can  be  shown  more  generally  that  Y  ~  Pois(A p)  (see  Problem  8.26)  so  that  our 
result  for  the  mean  of  Y  follows  directly  from  knowledge  of  the  mean  of  a  Poisson 
random  variable. 

0 

8.7  Computer  Simulation  Based  on  Conditioning 

In  Section  7.11  we  discussed  a  simple  method  for  generating  realizations  of  jointly 
distributed  discrete  random  variables  (X,  Y)  using  MATLAB.  To  do  so  we  required 
the  joint  PMF.  Using  conditioning  arguments,  however,  we  can  frequently  simplify 
the  procedure.  Since  px,Y[xi,Vj]  =  PY\x[yj\xi]px[%i],  a  realization  of  (X,  Y)  can 
be  obtained  by  first  generating  a  realization  of  X  according  to  its  marginal  PMF 
Px[xi\-  Then,  assuming  that  X  —  Xi  is  obtained,  we  next  generate  a  realization  of 
Y  according  to  the  conditional  PMF pY\x[yj\xi]-  (Of  course,  if  X  and  Y  are  inde¬ 
pendent,  we  replace  the  second  step  by  the  generation  of  Y  according  to  pY[yj]  since 
in  this  case  pY\x[yj\xi]  —  PY[yj]-)  This  is  also  advantageous  when  the  problem  de¬ 
scription  is  formulated  in  terms  of  conditional  PMFs,  as  in  a  compound  experiment. 
To  illustrate  this  approach  with  the  one  described  previously  we  repeat  Example 
7.15. 

Example  8.6  —  Generating  realizations  of  jointly  distributed  random 
variables  -  Example  7.15  (continued) 

The  joint  PMF  of  Example  7.15  is  shown  in  Figure  8.8,  where  the  solid  circles 
represent  the  sample  points  and  the  values  of  the  joint  PMF  are  shown  to  the  right 
of  the  sample  points.  To  use  a  conditioning  approach  we  need  to  find  px  and  pY\x- 
But  from  Figure  8.8,  if  we  sum  along  the  columns  we  obtain 
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y 

A 


1 


i 

8 


1 

2 


Figure  8.8:  Joint  PMF  for  Example  8.6. 


and  using  the  definition  of  the  conditional  PMF,  we  have 


and 


Py  |xL?|0] 


Px,y[  0,  j] 
Px[0] 
(U±-  l 

I  1/4-2 
|  1/8—1 
1  1/4  —  2 


Py|x[j|l] 


Px,r[M 
Px[l] 
f  1Z1  —  i 

I  3/4~3 

|  1/2-2 
f  3/4  ~  3 


The  MATLAB  segment  of  code  shown  below  generates  M  realizations  of  (X,Y) 
using  this  conditioning  approach. 


for  m=l:M 

ux=rand(l , 1)  ; 
uy=rand(l , 1) ; 

if  ux<=l/4;  #/«  Refer  to  px[i] 
x(m, 1)=0; 

if  uy<=l/2  7,  Refer  to  py  I  x  [  j  1 0] 
y(m,l)=0; 
else 

y(m, 1)=1 ; 
end 
else 

x(m,l)=l;  7®  Refer  to  px[i] 
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if  uy<=l/3  ’/,  Refer  to  py|x[j|l] 
y(m,l)=0; 
else 

y(m, 1)=1 ; 
end 
end 
end 

The  reader  is  asked  to  test  this  program  in  Problem  8.29. 

❖ 

8.8  Real-World  Example  -  Modeling  Human  Learning 

A  2  year-old  child  who  has  learned  to  walk  can  perform  tasks  that  not  even  the 
most  sophisticated  robots  can  match.  For  example,  a  2  year-old  child  can  easily 
maneuver  her  way  to  a  favorite  toy,  pick  it  up,  and  start  to  play  with  it.  Robots, 
powered  by  machine  vision  and  mechanical  grippers,  have  a  hard  time  performing 
this  supposedly  simple  task.  It  is  not  surprisingly,  therefore,  that  one  of  the  holy 
grails  in  cognitive  science  and  also  machine  learning  is  to  figure  out  how  a  child 
does  this.  If  we  were  able  to  understand  the  thought  processes  that  were  used 
to  successfully  complete  this  task,  then  it  is  conceivable  that  a  machine  might  be 
built  to  do  the  same  thing.  Many  models  of  human  learning  employ  a  Bayesian 
framework  [Tenenbaum  1999].  This  approach  appears  to  be  fruitful  in  that  using 
Bayesian  modeling  we  are  able  to  discriminate  with  more  and  more  accuracy  as 
we  repeatedly  perform  an  experiment  and  observe  the  outcome.  This  is  analogous 
to  a  child  attempting  to  pick  up  the  toy,  dropping  it,  picking  it  up  again  after 
having  learned  something  about  how  to  pick  it  up,  dropping  it,  etc.,  until  finally 
she  is  successful.  Each  time  the  experiment,  attempting  to  pick  up  the  toy,  is 
repeated  the  child  learns  something  or  equivalently  narrows  down  the  number  of 
possible  strategies.  In  Bayesian  analysis,  as  we  will  show  next,  the  width  of  the 
PMF  decreases  as  we  observe  more  outcomes.  This  is  in  some  sense  saying  that 
our  uncertainty  about  the  outcome  of  the  experiment  decreases  as  it  is  performed 
more  times.  Although  not  a  perfect  analogy,  it  does  seem  to  possess  some  critical 
elements  of  the  human  learning  process.  Therefore,  we  illustrate  this  modeling  with 
the  simple  example  of  coin  tossing. 

Suppose  we  wish  to  “learn”  whether  a  coin  is  fair  (p  =  1/2)  or  is  weighted 
(p/  1/2).  One  way  to  do  this  is  to  repeatedly  toss  the  coin  and  count  the  number 
of  heads  observed.  We  would  expect  that  our  certainty  about  the  conclusion,  that 
the  coin  is  fair  or  not,  would  increase  as  the  number  of  trials  increases.  In  the 
Bayesian  model  we  quantify  our  knowledge  about  the  value  of  p  by  assuming  that 
p  is  a  random  variable.  Our  particular  coin,  however,  has  a  fixed  probability  of 
heads.  It  is  just  that  we  do  not  know  what  it  is  and  hence  our  belief  about  the  value 
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of  p  is  embodied  in  an  assumed  PMF.  This  is  a  slightly  different  interpretation  of 
probability  than  our  previous  relative  frequency  interpretation.  To  conform  to  our 
previous  notation  we  let  the  probability  of  heads  be  denoted  by  the  random  variable 
Y  and  its  values  by  yj.  Then,  we  determine  its  PMF.  Our  state  of  knowledge  will  be 
high  if  the  PMF  is  highly  concentrated  about  a  particular  value,  as  for  example  in 
Figure  8.9a.  If,  however,  the  PMF  is  spread  out  or  “diffuse”,  our  state  of  knowledge 
will  be  low,  as  for  example  in  Figure  8.9b.  Now  let’s  say  that  we  wish  to  learn  the 


-0.5  0  0.5  1  1 .5 


Vj 

(a)  Y  =  probability  of  heads  -  state  of 
knowledge  is  high. 


-0.5  0  0.5  1  1 .5 


Vj 

(b)  Y  =  probability  of  heads  -  state  of 
knowledge  is  low. 


Figure  8.9:  PMFs  reflecting  state  of  knowledge  about  coin’s  probability  of  heads. 

value  of  the  probability  of  heads.  Before  we  toss  the  coin  we  have  no  idea  what  it 
is,  and  therefore  it  is  reasonable  to  assume  a  PMF  that  is  uniform,  as,  for  example, 
the  one  shown  in  Figure  8.9b.  Such  a  PMF  is  given  by 

Py lVj]  ~  m  +  i  ^or  yj  =  0?  ¥?  *  *  •  ’  Mm  1  ’  1  (8.37) 

for  some  large  M  (in  Figure  8.9b  M  —  11).  This  is  also  called  the  prior  PMF  since 
it  summarizes  our  state  of  knowledge  before  the  experiment  is  performed.  Now 
we  begin  to  toss  the  coin  and  examine  our  state  of  knowledge  as  the  number  of 
tosses  increases.  Let  N  be  the  number  of  coin  tosses  and  X  denote  the  number  of 
heads  observed  in  the  N  tosses.  We  know  that  the  PMF  of  the  number  of  heads 
is  binomially  distributed.  However,  to  specify  the  PMF  completely,  we  require 
knowledge  of  the  probability  of  heads.  Since  this  is  unknown,  we  can  only  specify 
the  PMF  of  X  conditionally  or  if  Y  =  yj  is  the  probability  of  heads,  then  the 
conditional  PMF  of  the  number  of  heads  for  X  —  i  is 

Px\y[i\yj]  =  ^  i  ^  y){  1  ~  yj)N~l  i  =  o,  1, . . . ,  N.  (8.38) 
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Since  we  are  actually  interested  in  the  probability  of  heads  or  the  PMF  of  Y  after 
observing  the  outcomes  of  N  coin  tosses,  we  need  to  determine  the  conditional  PMF 
PY\x[yj\i\-  The  latter  is  also  called  the  posterior  PMF ,  since  it  is  to  be  determined 
after  the  experiment  is  peformed.  The  reader  may  wish  to  compare  this  terminology 
with  that  used  in  Chapter  4.  The  posterior  PMF  contains  all  the  information  about 
the  probability  of  heads  that  results  from  our  prior  knowledge,  summarized  by  py, 
and  our  “data”  knowledge,  summarized  by  Px\y-  The  posterior  PMF  is  given  by 
Bayes’  rule  (8.15)  with  X{—%  as 

r  I-1_  Px\Y[i\yj]pY[yj\ 

PmW‘l 

Using  (8.37)  and  (8.38)  we  have 


PY\x[yjV]  = 


or  finally, 


ijm 


Uj  =  0, 1/M, . , . ,  1;  i  =  0, 1, . . . ,  N 


M+l 


Py\x[vM  = 


N-i 


ESioVi(i-w) 


N-i 


Uj  =  0, 1/M, . . . ,  1;  i  =  0, 1, . . . ,  N.  (8.39) 


Note  that  the  posterior  PMF  depends  on  the  number  of  heads  observed,  which  is 
i.  To  understand  what  this  PMF  is  saying  about  our  state  of  knowledge,  assume 
that  we  toss  the  coin  N  =  10  times  and  observe  i  —  4  heads.  The  posterior  PMF 
is  shown  in  Figure  8.10a.  For  N  —  20,  i  =  11  and  N  =  40,  i  =  19,  the  posterior 
PMFs  are  shown  in  Figures  8.10b  and  8.10c,  respectively.  Note  that  as  the  number 


Vj  Vj  Vj 

(a)  N  =  10,  i  =  4  (b)  N  =  20,  i  =  11  (c)  N  =  40,  i  =  19 


Figure  8.10:  Posterior  PMFs  for  coin  tossing  analogy  to  human  learning  -  coin 
appears  to  be  fair.  The  yf  s  are  possible  probability  values  for  a  head. 
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(a)  AT  =  10,  i  =  2 


(b)  N  =  20,  i  =  5 


Vj 

(c)  N  =  40,  i  =  7 


Figure  8.11:  Posterior  PMFs  for  coin  tossing  analogy  to  human  learning  -  coin 
appears  to  be  weighted.  The  y^’s  are  possible  probability  values  for  a  head. 


of  tosses  increases  the  posterior  PMF  becomes  narrower  and  centered  about  the 
value  of  0.5.  The  Bayesian  model  has  “learned”  the  value  of  p,  with  our  confidence 
increasing  as  the  number  of  trials  increases.  Note  that  for  no  trials  (just  set  N  =  0 
and  hence  i  —  0  in  (8.39))  we  have  just  the  uniform  prior  PMF  of  Figure  8.9b. 
From  our  experiments  we  could  now  conclude  with  some  certainty  that  the  coin 
is  fair.  However,  if  the  outcomes  were  N  =  10,  i  =  2,  and  N  =  20,  i  —  5,  and 
N  =  40,  i  —  7,  then  the  posterior  PMFs  would  appear  as  in  Figure  8.11.  We  would 
then  conclude  that  the  coin  is  weighted  and  is  biased  against  yielding  a  head,  since 
the  posterior  PMF  is  concentrated  about  0.2.  See  [Kay  1993]  for  futher  descriptions 
of  Bayesian  approaches  to  estimation. 
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Problems 

8.1  (w)  A  fair  coin  is  tossed.  If  it  comes  up  heads,  then  X  =  1  and  if  it  comes 
up  tails,  then  X  =  0.  Next,  a  point  is  selected  at  random  from  the  area  A 
if  X  =  1  and  from  the  area  B  if  X  =  0  as  shown  in  Figure  8.12.  Note  that 
the  area  of  the  square  is  4  and  A  and  B  both  have  areas  of  3/2.  If  the  point 
selected  is  in  an  upper  quadrant,  we  set  Y  =  1  and  if  it  is  in  a  lower  quadrant, 


PROBLEMS 


241 


we  set  Y  =  0.  Find  the  conditional  PMF  py|x[iN]  f°r  all  values  of  i  and  j. 
Next,  compute  P[Y  =  0]. 


Figure  8.12:  Areas  for  Problem  8.1. 


8.2  (^)  (w)  A  fair  coin  is  tossed  with  the  outcome  mapped  into  X  =  1  for  a  head 
and  X  —  0  for  a  tail.  If  it  comes  up  heads,  then  a  fair  die  is  tossed.  The 
outcome  of  the  die  is  denoted  by  Y  and  is  set  equal  to  the  number  of  dots 
observed.  If  the  coin  comes  up  tails,  then  we  set  Y  =  0.  Find  the  conditional 
PMF  PY\x\j\i\  f°r  all  values  of  i  and  j.  Next,  compute  P[Y  —  1]. 


8.3  (w)  A  fair  coin  is  tossed  3  times  in  succession.  All  the  outcomes  (i.e.,  the 
3-tuples)  are  equally  likely.  The  random  variables  X  and  Y  are  defined  as 


^  _  f  0  if  outcome  of  first  toss  is  a  tail 
\  1  if  outcome  of  first  toss  is  a  head 

Y  =  number  of  heads  observed  for  the  three  tosses 


Determine  the  conditional  PMF  Py\x[j\^}  for  all  i  and  j . 

8.4  (t)  Prove  that  T/'jL-ooPY\x[yj\xl]  =  1  for  all  xt. 

8.5  (^)  (w)  Are  the  following  functions  valid  conditional  PMFs 

a-  PY\x\j\xi]  =  (1  -Xi)3Xi  j  =  1,2,...; a;*  =  1/4, 1/2, 3/4 
b-  PY\x[j\xi]  =  (J)  ®J(1  -Xi)N~i  j  =  0,1,...  ,N;xi  =  -1/2, 1/2 
c*  PY\x\j\xi]  =  cxl  j  =  2, 3, . . . ;  Xi  =  2  for  c  some  constant? 


8.6  (0)  (f)  If 


Px,v[i,j]  = 


l 

6 

1 

3 

1 

3 

1 

6 


*  =  0,J=0 

*  =  o,i  =  l 

i  =  1,3=0 
i  =  l,j  =  l 


242 


CHAPTER  8.  CONDITIONAL  PROBABILITY  MASS  FUNCTIONS 


find  pY\x  and  px\y 


8.7  (f)  Verify  the  conditional  PMF  given  in  (8.10). 

8.8  (v^/)  (f)  For  the  sample  space  shown  in  Figure  8.1  determine  py\x  and  Px\Y  if 

all  the  outcomes  are  equally  likely.  Explain  your  results. 

8.9  (w)  Explain  the  need  for  the  denominator  term  in  (8.11)  and  (8.12). 

8.10  (w)  If  py\x  and  Py  are  known,  can  you  find  px,Y ? 

8.11  (^)  (w)  A  box  contains  three  types  of  replacement  light  bulbs.  There  is  an 
equal  proportion  of  each  type.  The  types  vary  in  their  quality  so  that  the 
probability  that  the  light  bulb  fails  at  the  jth  use  is  given  by 

PY\xm  =  (0.99)J'_10.01 
pr\x\m  =  (o.^o.i 
Py\xM  =  (0-8 

for  j  —  1, 2, . . ..  Note  that  py|x[iK]  is  the  PMF  of  the  bulb  failing  at  the  jth. 
use  if  it  is  of  type  i.  If  a  bulb  is  selected  at  random  from  the  box,  what  is  the 
probability  that  it  will  operate  satisfactorily  for  at  least  10  uses? 

8.12  (f)  A  joint  PMF  pxy[hj]  has  the  values  shown  in  Table  8.2.  Determine  the 
conditional  PMF  pY \x-  Are  the  random  variables  independent? 


j  =  1 

3  =  2 

co 

II 

i  =  1 

1 

10 

1 

10 

2 

10 

i  =  2 

1 

20 

1 

20 

1 

10 

i  =  3 

3 

10 

1 

20 

1 

20 

Table  8.2:  Joint  PMF  for  Problem  8.12. 


8.13  (o)  (w)  A  random  vector  (X,  Y)  has  a  sample  space  shown  in  Figure  8.13 
with  the  sample  points  depicted  as  solid  circles.  The  four  points  are  equally 
probable.  Note  that  the  points  in  Figure  8.13b  are  the  corners  of  the  square 
shown  in  Figure  8.13a  after  rotation  by  +45°.  For  both  cases  compute  pY\x 
and  py  to  determine  if  the  random  variables  are  independent. 

8.14  (t)  Use  the  properties  of  conditional  probability  and  the  definition  of  the  con¬ 
ditional  PMF  to  prove  (8.23).  Hint:  Let  A  =  U j{s  :  Y(s)  =  yj}  and  note  that 
the  events  {s  :Y(s)  =  yj}  are  mutually  exclusive. 
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Figure  8.13:  Joint  PMFs  -  each  point  is  equally  probable. 


8.15  (w)  If  X  and  Y  are  independent  random  variables,  find  the  PMF  of  Z  = 
| X  —  Y |.  Assume  that  Sx  =  {0, 1, . . .}  and  Sy  =  {0, 1, . . .}.  Hint:  The  answer 
is 


pz[k\  = 


T,Za  Px\i}PY[i]  k  =  0 

(Py \i]Px [*  +  k)+Px [i]py [i  +  k])  k  =  1,2, 


As  an  intermediate  step  show  that 


Pz\x[k\i]  = 


py[i ]  k  —  0 

py  [*  +  k]  +  py  [i  —  k]  k  ^  0. 


8.16  (w)  Two  people  agree  to  meet  at  a  specified  time.  Person  A  will  be  late  by 
i  minutes  with  a  probability  px [*]  =  (l/2)z+1  for  i  =  0, 1, . . .,  while  person  B 

will  be  late  by  j  minutes  with  a  probability  oipy\j]  =  (1/2)J+1  for  j  =  0, 1, _ 

The  persons  arrive  independently  of  each  other.  The  first  person  to  arrive  will 
wait  a  maximum  of  2  minutes  for  the  second  person  to  arrive.  If  the  second 
person  is  more  than  2  minutes  late,  the  first  person  will  leave.  What  is  the 
probability  that  the  two  people  will  meet?  Hint:  Use  the  results  of  Problem 
8.15. 

8.17  (o)  (w)  If  X  and  Y  are  independent  random  variables,  both  of  whose  PMFs 
take  on  values  {0, 1, . . .},  find  the  PMF  of  Z  =  min(X,  Y). 

8.18  (w)  If  X  and  Y  have  the  joint  PMF 

Px,Y[i,j]  =PiP2(l  -Pi)l(l  ~P2)j  i  =  o,  1, . . . ;  j  =  0, 1,. . . 

where  0  <  p\  <  1,  0  <  p2  <  1,  find  P[Y  >  X]  using  a  conditioning  argument. 
In  particular,  make  use  of  (8.23)  and  P[Y  >  X\X  =  i]  =  P[Y  >  i\X  =  i]. 
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8.19  (f)  If  X  and  Y  have  the  joint  PMF  given  in  Problem  8.6,  find  Ey\x\Y\xi]. 

8.20  (f )  If  X  and  Y  have  the  joint  PMF 

Xj 

exp(-A)  —  i  =  0, 1, . . . ;  j  =  0, 1, . . . 

3- 

find  i?y|^[y|i]  for  all  i. 

8.21  (o)  (f)  Find  the  conditional  mean  of  Y  given  X  if  the  joint  PMF  is  uniformly 
distributed  over  the  points  Sx,y  —  {(0,0),  (1,0),  (1, 1),  (2,0),  (2, 1),  (2,2)}. 

8.22  (^)  (f)  For  the  joint  PMF  given  in  Problem  8.21  determine  va,i(Y\xi)  for  all 
X{.  Explain  why  your  results  appear  to  be  reasonable. 

8.23  (t)  Prove  that  var(y|a^)  =  Ey\x\Y2\xi\  ~  ^Y\x\X\x^\  by  using  (8.31). 

8.24  (f)  Find  Ey[Y]  for  the  joint  PMF  given  in  Problem  8.21.  Do  this  by  using 
the  definition  of  the  expected  value  and  also  by  using  (8.36). 

8.25  (t)  Prove  the  extension  of  (8.36)  which  is 

EY[g(Y)]  =  EX  [EY[x[g(Y)\X]] 

where  h(X)  =  Ey\x\g(Y)\X]  is  a  function  of  the  random  variable  X  which 
takes  on  values 


Px,y[*J\ 


h(xt)  =  EY\X[g(Y)\xi]  =  J ~^g{yj)pY\x[yj\xi ]• 

3 

This  says  that  Ey\g{Y)\  can  be  computed  using  the  formula 


Ey[9(Y)]  =  £ 

i 


Ys(yj)PY\x[yj\xi] 

3 


Px[Xi]. 


8.26  (t)  In  this  problem  we  prove  that  if  M  ~  Pois(A)  and  Y  conditioned  on  M 
is  a  binomial  PMF  with  parameter  p,  then  the  unconditional  PMF  of  Y  is 
Pois(Ap).  This  means  that  if 

A  m 

PmH  =  exp(-A) — -  m  =  0, 1, . . . 

ml 

and 

PY\Af\j\m]  =  \  P1  (1  ~ P)m~j  j  =  0,1, . . .  ,m 

then 

Py  [j]  =  exp(-Xp)^-  j  =  0,1,.... 

3- 
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To  prove  this  you  will  need  to  derive  the  characteristic  function  of  Y  and  show 
that  it  corresponds  to  a  Pois(Ap)  random  variable.  Proceed  as  follows,  making 
use  of  the  results  of  Problem  8.25 

<Py(oj)  =  EY[exp(ju>Y)] 

=  Em  [Ey\m [exP (jwY)\M] 

=  em  [\pexp(juj)  +  (1  -  p)]M ] 
and  complete  the  derivation. 

8.27  (t)  In  Chapter  7  the  optimal  linear  predictor  of  Y  based  on  X  =  Xi  was  found. 
The  criterion  of  optimality  was  the  minimum  mean  square  error,  where  the 
mean  square  error  was  defined  as  Ex,y[(Y  —  ( aX  +  6))2].  In  this  problem  we 
prove  that  the  best  predictor,  now  allowing  for  nonlinear  predictors  as  well,  is 
given  by  the  conditional  mean  EY\x[Y\xi\.  To  prove  this  we  let  the  predictor 

A 

be  Y  =  g(X)  and  minimize 

Ex,y[(Y  -  g(X))2]  =  -  9(xi))2Px,Y[xi,Vj] 

*  3 


YMfj  ~  9(xi))2PY\x[yj\xi] 


PX  [Xi,  ■ 


But  since  px[%i]  is  nonnegative  and  we  can  choose  a  different  value  of  g{xi) 
for  each  xi,  we  can  equivalently  minimize 


YSto  ~  9{xi))2PY\x[yj\xi] 


where  we  consider  g(x{)  =  c  as  a  constant.  Prove  that  this  is  minimized  for 
g(xi)  =  Ey\x[Y\x2].  Hint:  You  may  wish  to  review  Section  6.6. 

8.28  (^_,)  (f)  For  random  variables  X  and  Y  with  the  joint  PMF 


PX,Y[hl  = 


(i,j)  =  (- 1,0) 
(t,i)  =  (0,-l) 
=  {  0,1) 
(*,J)  =  (1,0) 


we  wish  to  predict  Y  based  on  our  knowledge  of  the  outcome  of  X.  Find  the 
optimal  predictor  using  the  results  of  Problem  8.27.  Also,  find  the  optimal 
linear  predictor  for  this  problem  (see  Section  7.9)  and  compare  your  results. 
Draw  a  picture  of  the  sample  space  using  solid  circles  to  indicate  the  sample 
points  in  a  plane  and  then  plot  the  prediction  for  each  outcome  of  X  —  i  for 
i  =  —  1, 0, 1.  Explain  your  results. 
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8.29  (c)  Test  out  the  MATLAB  program  given  in  Section  8.7  to  generate  realiza¬ 
tions  of  the  vector  random  variable  (X,  Y)  whose  joint  PMF  is  given  in  Figure 
8.8.  Do  so  by  estimating  the  joint  PMF  or  px,y[hj\-  You  may  wish  to  review 
Section  7.11. 

8.30  (o)  (w,c)  For  the  joint  PMF  given  in  Figure  8.8  determine  the  conditional 
mean  EY\x[j\^\  and  then  verify  your  results  using  a  computer  simulation.  Note 
that  you  will  have  to  separate  the  realizations  (xm,ym)  into  two  sets,  one  in 
which  xm  =  0  and  one  in  which  xm  =  1,  and  then  use  the  sample  average  of 
each  set  as  your  estimator. 

8.31  (w,c)  For  the  joint  PMF  given  in  Figure  8.8  determine  Ey[Y],  Then,  verify 
(8.36)  by  using  your  results  from  Problem  8.30,  and  computing 

E^Y]  =  E^\Y\0]px[0]  +  E^\Y\l]px{l] 

where  EY |x[Y|0]  and  £V|x[Y|l]  are  the  values  obtained  in  Problem  8.30.  Also, 
the  PMF  of  X, which  needs  to  be  estimated,  can  be  done  so  as  described  in 
Section  5.9. 

8.32  (w,c)  For  the  posterior  PMF  given  by  (8.39)  plot  the  PMF  for  i  —  N/ 2, 
M  —  11  and  increasing  iV,  say  N  —  10, 30, 50, 70.  What  happens  as  N  becomes 
large?  Explain  your  results.  Hint:  You  will  need  a  computer  to  evaluate  and 
plot  the  posterior  PMF. 


Chapter  9 


Discrete  TV-Dimensional 
Random  Variables 

9.1  Introduction 

In  this  chapter  we  extend  the  results  of  Chapters  5-8  to  iV-dimensional  random  vari¬ 
ables,  which  are  represented  as  an  N  x  1  random  vector.  Hence,  our  discussions  will 
apply  to  the  2x1  random  vector  previously  studied.  In  fact,  most  of  the  concepts 
introduced  earlier  are  trivially  extended  so  that  we  do  not  dwell  on  the  conceptu¬ 
alization.  The  only  exception  is  the  introduction  of  the  covariance  matrix ,  which 
we  have  not  seen  before.  We  will  introduce  more  general  notation  in  combination 
with  vector/matrix  representations  to  allow  the  convenient  manipulation  of  N  x  1 
random  vectors.  This  representation  allows  many  results  to  be  easily  derived  and  is 
useful  for  the  more  advanced  theory  of  probability  that  the  reader  may  encounter 
later.  Also,  it  lends  itself  to  straightforward  computer  implementations,  particularly 
if  one  uses  MATLAB,  which  is  a  vector-based  programming  language.  Since  many 
of  the  methods  and  subsequent  properties  rely  on  linear  and  matrix  algebra,  a  brief 
summary  of  relevant  concepts  is  given  in  Appendix  C. 


9.2  Summary 

The  TV-dimensional  joint  PMF  is  given  by  (9.1)  and  satisfies  the  usual  properties  of 
(9.3)  and  (9.4).  The  joint  PMF  of  any  subset  of  the  N  random  variables  is  obtained 
by  summing  the  joint  PMF  over  the  undesired  ones.  If  the  joint  PMF  factors  as 
in  (9.7),  the  random  variables  are  independent  and  vice  versa.  The  joint  PMF 
of  a  transformed  random  vector  is  given  by  (9.9).  In  particular,  if  the  transformed 
random  variable  is  the  sum  of  TV  independent  random  variables  with  the  same  PMF, 
then  the  PMF  is  most  easily  found  from  (9.14).  The  expected  value  of  a  random 
vector  is  defined  by  (9.15)  and  the  expected  value  of  a  scalar  function  of  a  random 
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vector  is  found  via  (9.16).  As  usual,  the  expectation  operator  is  linear  with  a  special 
case  given  by  (9.17).  The  variance  of  a  sum  of  N  random  variables  is  given  by  (9.20) 
or  (9.21).  If  the  random  variables  are  uncorrelated,  then  this  variance  is  the  sum  of 
the  variances  as  per  (9.22).  The  covariance  matrix  of  a  random  vector  is  defined  by 
(9.25).  It  has  many  important  properties  that  are  summarized  in  Properties  9.1- 
5.  Particularly  useful  results  are  the  covariance  matrix  of  a  linearly  transformed 
random  vector  given  by  (9.27)  and  the  ability  to  decorrelate  the  elements  of  a 
random  vector  using  a  linear  transformation  as  explained  in  the  proof  of  Property 
9.5.  An  example  of  this  procedure  is  given  in  Example  9.4.  The  joint  moments  and 
characteristic  function  of  an  iV-dimensional  PMF  are  defined  by  (9.32)  and  (9.34), 
respectively.  The  joint  moments  are  obtainable  from  the  characteristic  function  by 
using  (9.36).  An  important  relationship  is  the  factorization  of  the  joint  PMF  into 
a  product  of  conditional  PMFs  as  given  by  (9.39).  When  the  random  variables 
exhibit  the  Markov  property,  then  this  factorization  simplifies  even  further  into  the 
product  of  first-order  conditional  PMFs  as  given  by  (9.41).  The  estimates  of  the 
mean  vector  and  the  covariance  matrix  of  a  random  vector  are  given  by  (9.44)  and 
(9.46),  respectively.  Some  MATLAB  code  for  implementing  these  estimates  is  listed 
in  Section  9.8.  Finally,  a  real-world  example  of  the  use  of  transform  coding  to 
store/ transmit  image  data  is  described  in  Section  9.9.  It  is  based  on  decorrelation 
of  random  vectors  and  so  makes  direct  use  of  the  properties  of  the  covariance  matrix. 


9.3  Random  Vectors  and  Probability  Mass  Functions 


Previously,  we  denoted  a  two-dimensional  random  vector  by  either  of  the  equivalent 
notations  ( X ,  Y)  or  [X  Y]T .  Since  we  now  wish  to  extend  our  results  to  an  N  x  1 
random  vector,  we  shall  use  (Xi,X2, ...  ,XN)  or  X  =  [X\  X2...  Xn]t.  Note  that 
a  boldface  character  will  always  denote  a  vector  or  a  matrix,  in  contrast  to  a  scalar 
variable.  Also,  all  vectors  are  assumed  to  be  column  vectors.  A  random  vector 
is  defined  as  a  mapping  from  the  original  sample  space  S  of  the  experiment  to  a 
numerical  sample  space,  which  we  term  Sx1,x2,...,Xn-  The  latter  is  normally  referred 
to  as  Rn  ,  which  is  the  AT-dimensional  Euclidean  space.  Hence,  X  takes  on  values 
in  Rn  so  that 


Xx(5) 

X2(s) 


XN(s) 


will  have  values 


xx 

X2 


xn 
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where  x  is  a  point  in  the  iV-dimensional  Euclidean  space  RN .  A  simple  example  is 
S  =  {all  lottery  tickets}  with  X(s)  representing  the  number  printed  on  the  ticket. 
Then,  X\  (s)  is  the  first  digit  of  the  number,  X2(<s)  is  the  second  digit  of  the  number, 
. . . ,  and  Xn(s)  is  the  iVth  digit  of  the  number. 

We  are,  as  usual,  interested  in  the  probability  that  X  takes  on  its  possible  values. 
This  probability  is  P[X\  =  #1,  X2  =  #2, . . . ,  X^  =  xn]  and  it  is  defined  as  the  joint 
PMF.  The  joint  PMF  is  therefore  defined  as 

PX\ \p^  1 5  ^2?  •  •  •  5  xN)  *r  1 5  X-2  )  •  •  •  5  Xjy  —  3? at]  (9.1) 

or  more  succinctly  using  vector  notation  as 


px[x]  =  P[X  =  x  . 


When  x  consists  of  integer  values  only,  we  will  replace  X{  by  k{.  Then,  the  joint 
PMF  will  be  pxi,x2,...,xN  [&i,  &2,  •  •  • ,  &iv]  or  more  succintly  as  px[k],  where  k  = 
[k\  k2  . . .  &Ar]r.  An  example  of  an  iV-dimensional  joint  PMF,  which  is  of  consid¬ 
erable  importance,  is  the  multinomial  PMF  (see  (4.19)).  In  our  new  notation  the 
joint  PMF  is 


Pxux2,...,xN[ki,k2,---,kN]  =  (^  k^\pkiP22  ■■■PkN 

where  ki  >  0  with  h  =  M,  and  0  <  pt  <  1  for  all  i  with  J^iLiPi  =  1-  That 
this  is  a  valid  joint  PMF  follows  from  its  adherence  to  the  usual  properties 


0  ~  PX\  ,X2,...,X N  [^1  3  ^2  5  *  •  •  )  kpf]  ^1  (9.3) 

/]  •  •  •  YJPx1,x2,-,xN[ki,k2,  ...,kN]  =  1.  (9.4) 

k\  k2  kjy 

To  prove  (9.4)  we  need  only  use  the  multinomial  expansion,  which  is  (see  Problem 
9.3) 


(cti  4-  a.2  +  ■  ■  ■  +  ajv)^ 


where  YliLi  k%  =  M. 

The  marginal  PMFs  are  obtained  from  the  joint  PMF  by  summing  over  the  other 
variables.  For  example,  if  px1  [a:i]  is  desired,  then 


Px1  [®i]  =  E  E  -  E  PXi,X2,...,Xn  [®1,  x2i  ■  •  •  i  *^iv]  (9.6) 

{*2:S2€-Sx2}  {^3:X3e5x3}  {xn'-Xn€Sxn  } 

and  similarly  for  the  other  N  -  1  marginals.  This  is  because  the  right-hand  side  of 
(9.6)  is 


P[X\  —  x\,X2  G  Sx2,X 3  G  Sx3,  •  •  •  ,Xjf  G  =  P[X\  =  x{\. 
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When  the  random  vector  is  composed  of  more  than  two  random  variables,  we  can 
also  obtain  the  joint  PMF  of  any  subset  of  the  random  variables.  We  do  this  by 
summing  over  the  variables  that  we  wish  to  eliminate.  If,  say,  we  wish  to  determine 
the  joint  PMF  of  X\  and  X/v,  we  have 

PXi,Xn[xi,XN]  =  EE -E  PX  1,X2,...,XN  [®1,  X2,  ■  •  •  ,  Xn]- 

X2  Xs  Xpf-i 

As  in  the  case  of  N  =  2  the  marginal  PMFs  do  not  determine  the  joint  PMF, 
unless  of  course  the  random  variables  are  independent.  In  the  iV-dimensional  case 
the  random  variables  are  defined  to  be  independent  if  the  joint  PMF  factors  or  if 

Pxux2,...,xn[xux2i  ---,xn]  =  Px1[xi]px2[x2]  •  •  -Pxn[xn]-  (9.7) 

Hence,  if  (9.7)  holds,  the  random  variables  are  independent,  and  if  the  random 
variables  are  independent  (9.7)  holds.  Unlike  the  case  of  N  =  2,  it  is  possible  that 
the  joint  PMF  may  factor  into  two  or  more  joint  PMFs.  Then,  the  subsets  of  random 
variables  are  said  to  be  independent  of  each  other.  For  example,  if  JV  =  4  and  the 
joint  PMF  factors  as  Pxux2,x39x4[xu ^2,^x4]  =  Pxux2[xux^Px^xA[x^x^  then 
the  random  variables  (Xi,X2)  are  independent  of  the  random  variables  (Xs^X^). 
An  example  of  the  determination  of  a  joint  PMF  follows. 

Example  9.1  —  Joint  PMF  for  independent  Bernoulli  trials 

Consider  an  experiment  in  which  we  toss  a  coin  with  a  probability  of  heads  p, 
N  times  in  succession.  We  let  X{  =  1  if  the  ith  outcome  is  a  head  and  X{  =  0 
if  it  is  a  tail.  Furthermore,  assume  that  the  trials  are  independent  As  defined 
in  Chapter  4,  this  means  that  the  probability  of  the  outcome  on  any  trial  is  not 
affected  by  the  outcomes  of  any  of  the  other  trials.  Thus,  the  experiment  is  a 
sequence  of  independent  Bernoulli  trials.  The  sample  space  is  iV-dimensional  and 
is  given  by  Sx i,x2i...,xN  =  {(&i>  •  •  • ,  &iv)  :  h  =  0, 1  ;  i  =  1,2,...,  TV},  and  since 

PXi[ki]  =pki{  1  —p)l~k\  we  have  the  joint  PMF  from  (9.7) 

N 

PXi,x2,...,xN[k\,k2,...1kN]  -  px,  [ki 

i= 1 
N 

=  n^a-p)1-* 

i—  1 

—  pEi=ifci(i  _  p)W-Ei=i  (9,8) 

❖ 

A  joint  cumulative  distribution  function  (CDF)  can  be  defined  in  the  AT-dimensional 
case  as 


Fxi,X2,...,Xn(xi,X2,  ■  ■  .,Xn)  =  P[Xi  <  Xi,X2  <x2,..  .,Xjy  <  Xn  . 
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It  has  the  usual  properties  of  being  between  0  and  1,  being  monotonically  increasing 
as  any  of  the  variables  increases,  and  being  “right  continuous”.  Also, 

Fxi  ,X2,..->Xn  (  oo^  oo, . . . ,  oo)  0 

Xn  ("bOO?  "boo,  •  •  •  ,  +00)  1. 

The  marginal  CDFs  are  easily  found  by  letting  the  undesired  variables  be  evaluated 
at  +oo.  For  example,  to  determine  the  marginal  CDF  for  Xi,  we  have 

Fxi[%  l]  =  Fxi,x2,...,xn{x  i,+oo, +oo,  ...,+oo). 

9.4  Transformations 

Since  X  is  an  N  x  1  random  vector,  a  transformation  or  mapping  to  a  random  vector 
Y  can  yield  another  N  x  1  random  vector  or  an  M  x  1  random  vector  with  M  <  N. 
In  the  former  case  the  formula  for  the  joint  PMF  of  Y  is  an  extension  of  the  usual 
one  (see  (7.12)).  If  the  transformation  is  given  as  y  =  g(x),  where  g  represents  an 
TV-dimensional  function  or  more  explicitly 

Vi  9i («^T 5  *^2?  •  •  •  7  *^iv) 

V2  =  92(xuX2,---,XN) 

UN  QN  (*^1  j  #2?  •  *  •  5  ) 

then 

PYuY2,...,YN[yuy2i  •••,vn]  =  EE  -E  Px i ,X2,...,Xn  \p^\ ?  5  •  •  •  ?  xn] •  (9.9) 

(*^1  )•  .-}x  jy ) — y\  ,•••) 

In  the  case  where  the  transformation  is  one-to-one,  there  is  only  one  solution  for 
x  in  the  equation  y  =  g(x),  which  we  denote  symbolically  by  x  =  g-1(y).  The 
transformed  joint  PMF  becomes  from  (9.9)  py[y]  =  Px[g-1(y)],  using  vector  no¬ 
tation.  A  simple  example  of  this  is  when  the  transformation  is  linear  and  so  can 
be  represented  by  y  =  Ax,  where  A  is  an  N  x  N  nonsingular  matrix.  Then,  the 
solution  is  x  =  A-1y  and  the  transformed  joint  PMF  becomes 

Py[y]  =Px[A_1y].  (9.10) 

The  other  case,  in  which  Y  has  dimension  less  than  N,  can  be  solved  using  the 
technique  of  auxiliary  random  variables.  We  add  enough  random  variables  to  make 
the  dimension  of  the  transformed  random  vector  equal  to  N,  find  the  joint  PMF  via 
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(9.9),  and  finally  sum  the  TV-dimensional  PMF  over  the  auxiliary  random  variables. 
More  specifically,  if  Y  is  M  x  1  with  M  <  TV,  we  define  a  new  N  x  1  random  vector 

Z  =  [Yl  y2...  Ym  ZM+ 1  =  Xm+ 1  ^M+2  =  -Xm+2  •  •  •  ZiV  —  Xn]T 

so  that  the  transformation  becomes  one-to-one,  if  possible.  Once  the  joint  PMF  of 
Z  is  found,  we  can  determine  the  joint  PMF  of  Y  as 

PYl  ,Y2vj^M  [?/l  »  2/2  5  •  •  •  5  Hm\  ^  ^  ^  ^  ^  ^  PZl,Z2f;ZN  \%l  7  ^2  7  •  •  •  ?  3jv]- 

^M+l  2M+2  ^iV 

The  determination  of  the  PMF  of  a  transformed  random  vector  is  in  general  not  an 
easy  task.  Even  to  determine  the  possible  values  of  Y  can  be  quite  difficult.  An 
example  follows  that  illustrates  the  work  involved. 

Example  9.2  —  PMF  for  one-to-one  transformation  of  TV-dimensional 
random  vector 

In  Example  9.1  X  has  the  joint  PMF  given  by  (9.8).  We  define  a  transformed 
random  vector  as 


*i  -  Xi 

y2  =  X1  +  X2 

Y3  =  X\  +  X2  +  X3. 


This  is  a  linear  transformation  that  maps  a  3  x  1  random  vector  X  into  another 
3x1  random  vector  Y.  It  can  be  represented  by  the  3x3  matrix 


1  0  0 
1  1  0 
1  1  1 


Note  that  the  transformed  random  variables  are  the  sums  of  the  outcomes  of  the  first 
Bernoulli  trial,  the  first  and  second  Bernoulli  trials,  and  finally  the  sum  of  the  first 
three  Bernoulli  trials.  As  such  the  values  of  the  transformed  random  variables  must 
take  on  certain  values.  In  particular,  yi  <  y2  <  yz  or  the  outcomes  must  increase 
as  the  index  i  increases.  This  is  sometimes  called  a  counting  process  and  will  be 
studied  in  more  detail  when  we  discuss  random  processes.  Some  typical  realizations 
of  the  random  vector  Y  are  shown  in  Figure  9.1.  To  determine  the  sample  space 
for  Y  we  enumerate  the  possible  values,  making  sure  that  the  values  in  the  vector 
increase  or  stay  the  same  and  that  the  increase  is  at  most  one  unit  from  yi  to  yi+i. 
The  sample  space  is  composed  of  integer  3-tuples  (luhih),  which  is  given  by 


SYuYtto  =  {(0, 0, 0),  (0, 0, 1),  (0, 1, 1),  (1, 1, 1),  (0, 1, 2),  (1, 1, 2),  (1, 2, 2),  (1, 2, 3)}. 

(9-U) 

These  are  the  values  of  y  for  which  £>Yi,y2,y3  is  nonzero  and  are  seen  to  be  integer¬ 
valued.  Next,  we  need  to  solve  for  x  according  to  (9.10).  It  is  easily  shown  that  the 
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(a)  (b)  (c) 

Figure  9.1:  Typical  realizations  for  sum  of  outcomes  of  independent  Bernoulli  trials. 


linear  transformation  is  one-to-one  since  A  has  an  inverse  (note  that  the  determinant 
of  A  is  nonzero  since  det(A)  =  1,  and  so  A  has  an  inverse),  which  is 


0  0 

1  0 

-1  1 


This  says  that  x  =  A  *y  or  x\  =  yi,  X2  =  y?  —  yi,  £3  =  2/3  —  3/2-  Thus,  we  can  use 
(9.10)  and  then  (9.8)  to  find  the  joint  PMF  of  Y,  which  becomes  from  (9.10) 

PYi,Y2,Y3[h,h,h]  =  Pxx,x2,x3[h,h  -  hj 3  -  h] 


and  since  from 


(9.8) 

PXuX2,X3[kl,k2,h]  =pk i+^+^i  _  p)3-(fc1+fc2+fc3) 


we  have  that 

PYuY2,Y3[h,h,h]  =Pl3(  1  ~p)3~l3-  (9-12) 

Note  that  the  joint  PMF  is  nonzero  only  over  the  sample  space  Sy,  .y2.y:,  given  in 
(9.11). 

❖ 


Always  make  sure  PMF  values  sum  to  one. 


The  result  of  the  previous  example  looks  strange  in  that  the  joint  PMF  of  Y  does 
not  depend  on  l\  and  h-  A  simple  check  that  should  always  be  made  when  working 
these  types  of  problems  is  to  verify  that  the  PMF  values  sum  to  one.  If  not,  then 
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there  is  an  error  in  the  calculation.  If  they  do  sum  to  one,  then  there  could  still 
be  an  error  but  it  is  not  likely.  For  the  previous  example,  we  have  from  (9.11)  1 
outcome  for  which  I3  =  0,  3  outcomes  for  which  I3  =  1,  3  outcomes  for  which  1 3  =  2, 
and  1  outcome  for  which  1 3  =  3.  If  we  sum  the  probabilities  of  these  outcomes  we 
have  from  (9.12) 


1(1  -  p )3  +  3p(l  -  p )2  4-  3p2(l  -p)  +  p3  =  1 


and  hence  we  can  assert  with  some  confidence  that  the  result  is  correct. 

A 

A  transformation  that  is  not  one-to-one  but  that  frequently  is  of  interest  is  the 
sum  of  N  independent  discrete  random  variables.  It  is  given  by 


N 


Y 


=  £* 


(9.13) 


where  the  X^s  are  independent  random  variables  with  integer  values.  For  the  case 
of  N  =  2  and  integer- valued  discrete  random  variables  we  saw  in  Section  7.6  that 
Py  —  Px  1  where  *  denotes  discrete  convolution.  This  is  most  easily  evaluated 

using  the  characteristic  functions  and  the  inverse  Fourier  transform  to  yield 

pY[k]  = 

For  a  sum  of  N  independent  random  variables  we  have  the  similar  result 


/7T 

-7T 


<t>xx  M  <j>x2  M  exp(-juk) 


du 

2ix 


rir  N 

pY[k]  -  /  n  4*  )  exp(-juk) 

J i= 1 


du ) 
2n 


and  if  all  the  X^s  have  the  same  PMF  and  hence  the  same  characteristic  function, 
this  becomes 

/7T  si,, 

(f>%(uj)exp{-juk)—  (9.14) 

where  </>x(^)  is  the  common  characteristic  function.  An  example  follows  (see  also 
Problem  9.9). 

Example  9.3  —  Binomial  PMF  derived  as  PMF  of  sum  of  independent 
Bernoulli  random  variables 

We  had  previously  derived  the  binomial  PMF  by  examining  the  number  of  successes 
in  N  independent  Bernoulli  trials  (see  Section  4.6.2).  We  can  rederive  this  result  by 
using  (9.14)  with  X{  —  1  for  a  success  and  Xj  =  0  for  a  failure  and  determining  the 
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PMF  of  Y  —  X!i=i  Xi-  The  random  variable  Y  will  be  the  number  of  successes  in 
N  trials.  The  characteristic  function  of  X  is  for  a  single  Bernoulli  trial 

4>x(u)  =  Ex[exp(jujX)} 

=  exp(ju>(l))p  +  exp(jw(0))(l  - p) 

=  pexp{ju)  +  (l-p). 


Now  using  (9.14)  we  have 


/7T 

[pexp(juj)  +  (1  -  p)]N  exp(-juk) 

-7T 


•7T  N  /  jy 


du 

27T 


/7T 

E 

*=o 


[pexp(jw)]*(l  -p)^  lexp(-ju;/c) 

£  /  Ztt 


dcj 


(use  binomial  theorem) 


=  XI  /  exp|>'w(*  -  *)] 

i= o'*'  •y-?r 


d(jj 

2n 


But  the  integral  can  be  shown  to  be  0  if  i  ±  k  and  1  if  i  =  k  (see  Problem  9.8). 
Using  this  result  we  have  as  the  only  term  in  the  sum  being  nonzero  the  one  for 
which  i  =  k ,  and  therefore 

Pv[k]  =  (^)  pk{l  -p)N~k  k  =  0,l,...,N. 

The  sum  of  N  independent  Bernoulli  random  variables  has  the  PMF  bin (N,p)  in 
accordance  with  our  earlier  results. 

❖ 


9.5  Expected  Values 

The  expected  value  of  a  random  vector  is  defined  as  the  vector  of  the  expected  values 
of  the  elements  of  the  random  vector.  This  is  to  say  that  we  define 


Ex[X]  =  EXux2,...,xn 


We  can  view  this  definition  as  “passing”  the  expectation  “through”  the  left  bracket 
of  the  vector  since  Exux2,...,xN[xi\  =  EXi[Xi\. 

A  particular  expectation  of  interest  is  that  of  a  scalar  function  of  X\ ,  X2,  ■  ■  ■ ,  Xx, 
say  g(Xi,  X2, . . . ,  Xx).  Similar  to  previous  results  (see  Section  7.7)  this  is  deter¬ 
mined  by  using 


Xx 

'  EXl[Xx]  ' 

x2 

• 

Ex2  [X2] 

• 

• 

• 

XN  _ 

• 

• 

_  EXn  [Xjy] 

(9.15) 
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Ex  i  ,X2,-.-,Xn  [g(X1,X2,...,XN)} 

•  •  •  ^  ^  •  •  •  5  % N^PXi ,X2r--,Xjsf  fa  1?  ^2? 


•  ,#iv] 


(9.16) 


£i  X2  xn 


As  an  example,  if  g(X i,  X2, . . . ,  Xjv)  =  then 


Exux2,...,xn 


N 

E*< 

Lz=l 


—  EE  '  ‘  '  E(*l  +  X2  ^ - *"  xn)PXuX2,...,Xn[xI,x2 > •  •  •  >®jv] 

£l  #2  £iV 

^  y  ^  y  *  *  *  ^  y  %lPXl  ,X2,...,XjV  [**T?  ^2,  •  •  •  ?  «£jv] 

£l  £2  £;v 

+  EE-E  ^2PX\ ,X2,...,Xat  [*^1 5  *^2?  •  •  •  5  ^7v] 

£1  £2  £;v 

+  •••  +  EE -E  •^NPXi ,X2 ,...,Xn  I5  ^2?  •  •  *  5  *^iv] 

£l  £2  £  AT 

—  &Xi  [A^i]  +  Ex2 [X2]  +  •  *  •  4-  Exn  [Xn]. 


By  a  slight  modification  we  can  also  show  that 


Ex  ux2,...,xN 


r  n 


,i=l 


N 


E  aiE*i  [Xi] 

i= 1 


(9.17) 


which  says  that  the  expectation  is  a  linear  operator.  It  is  also  possible  to  write 
(9.17)  more  succinctly  by  defining  the  N  x  1  vector  a  =  [ai  a2  ■  .  .  ajv]T  to  yield 

£x[aTX]  =  a tEx[X].  (9.18) 


We  next  determine  the  variance  of  a  sum  of  random  variables.  Previously  it  was 
shown  that 

var(Xi  +  X2)  —  var(Ai)  +  var(X2)  +  2cov(Xi,X2).  (9.19) 

Our  goal  is  to  extend  this  to  var(2*Li  Xi)  for  any  N.  To  do  so  we  proceed  as 
follows. 


var 


N 


Y,x< 


'/  K 

'  N 

\21 

Ex 

Ex*-** 

X> 

_\i= 1 

.1=1 

/  J 

Ex 


N 


Y,(Xi  -  EXi[Xi]) 


(since  %[X(]  =  EXi[Xi}) 
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and  by  letting  Ui  —  X{  —  Ex{  [Xi]  we  have 


TV 


var 


Y,Xi  =  E X 


i= 1 


iV 


=  £x 


2=1 
TV  TV 


UiUj 


i= 1  J=1 
AT  TV 

*=i  j=i 


But 


£x[£W,]  = 

=  cov(Xi,Xj) 


so  that  we  have  as  our  final  result 

(N  \  AT  iV 

=  H2Y1  cov(Xi,Xj).  (9.20) 

i=l  /  i=l  j= 1 

Noting  that  since  cov(X,, Xj)  =  var(Xj)  and  cov(Xj,Xi)  =  covpQ,X,),  we  have 
for  iV  =  2  our  previous  result  (9.19).  Also,  we  can  write  (9.20)  in  the  alternative 
form 

(N  \  N  N  N 

X, Xi ) =  Y2  var(x*')  +  Y2  cov(^>  xj)-  (9-21) 

2=1  /  2=1  2=1  J  =  1 

As  an  immediate  and  important  consequence,  we  see  that  if  all  the  random  variables 
are  uncorrelated  so  that  co v(Xi,Xj)  =  0  for  i  /  j,  then 


(N  \  N 

^2  Xi  )  =  ^2  var(A'j)  (9.22) 

i- 1  /  i= 1 

which  says  that  the  variance  of  a  sum  of  uncorrelated  random  variables  is  the  sum 
of  the  variances. 

We  wish  to  explore  (9.20)  further  since  it  embodies  some  important  concepts 
that  we  have  not  yet  touched  upon.  For  clarity  let  N  =  2.  Then  (9.20)  becomes 

2  2 

var(Xi  +  X2)  =  ^  ^  cov(X*,  Xj). 


(9.23) 
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If  we  define  a  2  x  2  matrix  Cx  as 


var(Xj )  cov(Xi,  X2) 
cov(X2,  Xi)  var(X2) 


then  we  can  rewrite  (9.23)  as 


var(Xi  +  X2)  =  [  1  1  ]  Cx 


1 

1 


(9.24) 


as  is  easily  verified.  The  matrix  Cx  is  called  the  covariance  matrix.  It  is  a  matrix 
with  the  variances  along  the  main  diagonal  and  the  covariances  off  the  main  diagonal. 
For  N  =  3  it  is  given  by 


var(Xi)  cov(Xi,X2) 
cov(X2,Xi)  var(X2) 
cov  (X3,  Xi)  cov  (X3 ,  X2 ) 


covlXuXs) 

cov(X2,X3) 

var(X3) 


and  in  general  it  becomes 


var(Xi) 

cov(X2,X!) 


cov(XuX2) 

var(X2) 


cov(Xi,X/v) 

cov(X2,  Xx) 


cov(XN,Xi)  cov(XN,X2) 


cov(Xn,Xn) 


(9.25) 


The  covariance  matrix  has  many  important  properties,  which  are  discussed  next. 

Property  9.1  —  Covariance  matrix  is  symmetric,  i.e.,  CJX  =  Cx- 
Proof: 

cov  (Xj,Xi)  —  cov  ( Xi,Xj )  (Why? ) 


Property  9.2  —  Covariance  matrix  is  positive  semidefinite. 

Being  positive  semidefinite  means  that  if  a  is  the  N  x  1  column  vector  a  = 
[ai  a2  . . .  ajv]  ,  then  a7Cxa  >  0  for  all  a.  Note  that  aTC^a  is  a  scalar  and  is 
referred  to  as  a  quadratic  form  (see  Appendix  C) . 

Proof:  Consider  the  case  of  N  =  2  since  the  extension  is  immediate.  Let  Ui  = 
Xi  —  Ext[Xi\,  which  is  zero  mean,  and  therefore  we  have 


var(aiXi  +  a2X2) 
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=  var(o,iLri  +  0,9 U2 )  (since  a\X\  +  a  2 X2  =  a\U\  +  a.2 f72  +  c  for  c  a  constant) 

=  ExKaiU!  +  a2U2)2}  (Ex[U{\  =  EX[U2]  =  0) 

=  af  /i’x [Lf  ]  +  o|-Bx[^2]  +  °i a 2 Ex [U\ U2]  +  a2u  1  Ex \U2 U] ]  (linearity  of  Ex) 

=  afvar^i)  +  a2var(X2)  +  aia2cov(Xi,  X2)  +  a2aicov(X2,  X\) 


/nr  1  il  c\  1 

var(-Xi)  cov(Xi,X2)  ' 

ttl 

L  J 

_  cov(X2,Xi)  var(X2) 

.  a2 

=  aTCxa. 


Since  var(o,  1  X\  +a2X2)  >  0  for  all  a\  and  a2,  it  follows  that  Cx  is  positive  semidef- 
inite. 

□ 

Also,  note  that  the  covariance  matrix  of  random  variables  that  are  not  perfectly 
predictable  by  a  linear  predictor  is  positive  definite.  A  positive  definite  covariance 
matrix  is  one  for  which  aTCya  >  0  for  all  a/0.  If,  however,  perfect  prediction 
is  possible,  as  would  be  the  case  if  for  N  —  2  we  had  aiXi  +  a2X2  +  c  =  0,  for  c 
a  constant  and  for  some  a\  and  a2,  or  equivalently  if  X2  =  — (ai/a2)Xi  —  (c/a2), 
then  the  covariance  matrix  is  only  positive  serm'definite.  This  is  because  var(o,i  A]  + 
0,9  X2 )  =  a7  C  \  a  =  0  in  this  case. 

Finally,  with  the  general  result  that  (see  Problem  9.14) 


var 


=  aTCxa 


(9.26) 


we  have  upon  letting  a  =  1  =  [1 1 . . .  1]T  be  an  N  x  1  vector  of  ones  that 


var 


which  is  another  way  of  writing  (9.20)  (the  effect  of  premultiplying  a  matrix  by  1T 
and  postmultiplying  by  1  is  to  sum  all  the  elements  in  the  matrix). 

The  fact  that  the  covariance  matrix  is  a  symmetric  positive  semidefinite  matrix 
is  important  in  that  it  must  exhibit  all  the  properties  of  that  type  of  matrix.  For 
example,  if  a  matrix  is  symmetric  positive  semidefinite,  then  it  can  be  shown  that 
its  determinant  is  nonnegative.  As  a  result,  it  follows  that  the  correlation  coefficient 
must  have  a  magnitude  less  than  or  equal  to  one  (see  Problem  9.18).  Some  other 
properties  of  a  covariance  matrix  follow. 

Property  9.3  —  Covariance  matrix  for  uncorrelated  random  variables  is 
a  diagonal  matrix. 

Note  that  a  diagonal  matrix  is  one  for  which  all  the  off-diagonal  elements  are  zero. 
Proof:  Let  co v(Xi,Xj)  =  0  for  i  ^  j  in  (9.25). 


□ 
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Before  listing  the  next  property  a  new  definition  is  needed.  Similar  to  the  definition 
that  the  expected  value  of  a  random  vector  is  the  vector  of  expected  values  of  the 
elements,  we  define  the  expectation  of  a  random  matrix  as  the  matrix  of  expected 
values  of  its  elements.  As  an  example,  if  N  =  2  the  definition  is 


<m(x)  0i2(X) 

521  (X)  522  (X) 


3x[5ii(X)]  Ex[gi2  (X)] 
£x[52l(X)]  £x[522(X)] 


Property  9.4  —  Covariance  matrix  of  Y  =  AX,  where  A  is  an  M  x  N 
matrix  (with  M  <  N),  is  easily  determined. 

The  covariance  matrix  of  Y  is 


Cy  =  AC*AT. 


(9.27) 


Proof: 

To  prove  this  result  without  having  to  explicitly  write  out  each  element  of  the  various 
matrices  requires  the  use  of  matrix  algebra.  We  therefore  only  sketch  the  proof  and 
leave  some  details  to  the  problems.  The  covariance  matrix  of  Y  can  alternatively 
be  defined  by  (see  Problem  9.21) 


Cy  =  Ey  [(Y  -  #y[Y])(Y  -  ^y[Y])t;  . 


Therefore, 

“  '  -1 

Cy  =  Ex  [(AX  -  #x[AX])(AX  -  £x[AX])t] 

-  EX  [A(X  -  Ex[X})( A(X  -  £x[X]))r] 

(see  Problem  9.22) 

=  A Ex  [(X  -  Ex[X])(X  -  £x[X])r]  Ar 

(see  Problem  9.23) 

=  ACxAr. 

□ 

This  result  subsumes  many  of  our  previous  ones  (try  A  =  1T  =  [1 1 . . .  1]  and  note 
that  Cy  =  var(Y)  if  M  —  1,  for  example!). 

Property  9.5  —  Covariance  matrix  can  always  be  diagonalized. 

The  importance  of  this  property  is  that  a  diagonalized  covariance  matrix  implies 
that  the  random  variables  are  uncorrelated.  Hence,  by  transforming  a  random 
vector  of  correlated  random  variable  elements  to  one  whose  covariance  matrix  is 
diagonal,  we  can  decorrelate  the  random  variables.  It  is  exceedingly  fortunate  that 
this  transformation  is  a  linear  one  and  is  easily  found.  In  summary,  if  X  has  a 
covariance  matrix  Cx,  then  we  can  find  an  N  x  N  matrix  A  so  that  Y  =  AX  has 
the  covariance  matrix 


Cy  = 


var(Yi)  0 

0  var(Y2) 


0 

0 


0 


0 


•  •  • 


var(Yj\r) 
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The  matrix  A  is  not  unique  (see  Problem  7.35  for  a  particular  method).  One  possible 
determination  of  A  is  contained  within  the  proof  given  next. 

Proof: 

We  only  sketch  the  proof  of  this  result  since  it  relies  heavily  on  linear  and  matrix 
algebra  (see  also  Appendix  C).  More  details  are  available  in  [Noble  and  Daniel 
1977].  Since  Cx  is  a  symmetric  matrix,  it  has  a  set  of  N  orthonormal  eigenvectors 
with  corresponding  real  eigenvalues.  Since  Cx  is  also  positive  semidefinite,  the 
eigenvalues  are  nonnegative.  Hence,  we  can  find  N  xl  eigenvectors  {vi,  V2,  • . . ,  vjv} 
so  that 

Cxvi  =  A iVi  i  =  1,2,...  ,N 

where  =  0  for  i  ^  j  (orthogonality),  vfvi  =  1  (normalized  to  unit  length), 

and  Xi  >  0.  We  can  arrange  the  N  x  1  column  vectors  and  also  A into 

N  x  N  matrices  so  that 


[  Cxvi  CxV2  . . .  C xvn  ]  =  [  Aivi  A2V2  . . .  AxVjv  .  (9.28) 


But  it  may  be  shown  that  for  an  N  x  N  matrix  A  and  N  x  1  vectors  bi,  b2,  di,  d2, 
using  N  =  2  for  simplicity  (see  Problem  9.24), 


[  Abi  Ab2  ] 

[  cidi  c2d2  ] 


A  [  bi  b2 


dx  d2 


ci 

0 


0 

C2 


Using  these  relationships  (9.28)  becomes 


(9.29) 

(9.30) 


Cx  [  vi  v2  ...  vjy  ]  =  [  vi  v2  ...  vat  ] 


■v 

V 


Ai  0  . . . 

0  A2 


0  0 


0 

0 

A  n 


-V- 

A 


or 

CXV  =  VA. 

(The  matrix  V  is  known  as  the  modal  matrix  and  is  invertible.)  Premultiplying 
both  sides  by  V-1  produces 

V_1CxV  =  A. 


Next  we  use  the  property  that  the  eigenvectors  are  orthonormal  to  assert  that  V-1  = 
Vr  (a  property  of  orthogonal  matrices) ,  and  therefore 

VTCxV  =  A 


(9.31) 
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Now  recall  from  Property  9.4  that  if  Y  =  AX,  then  Cy  =  ACx  AT.  Thus,  if  we 
let  Y  =  AX  =  VTX,  we  will  have 

Cy  =  VTCxV  (from  Property  9.4) 

=  A  (from  (9.31)) 

and  the  covariance  matrix  of  Y  will  be  diagonal  with  ith  diagonal  element  var(l^)  = 
At  >  0. 

□ 

This  important  result  is  used  extensively  in  many  disciplines.  Later  we  will  see  that 
for  some  types  of  continuous  random  vectors,  the  use  of  this  linear  transformation 
will  make  the  random  variables  not  only  uncorrelated  but  independent  as  well  (see 
Example  12.14).  An  example  follows. 

Example  9.4  —  Decorrelation  of  random  variables 

We  consider  a  two-dimensional  example  whose  joint  PMF  is  given  in  Table  9.1.  We 


X2  =  -8 

X2  =  0 

X2  =  2 

£2  =  6 

PxAxi] 

x\  —  —8 

0 

1 

4 

0 

0 

1 

4 

x\  =  0 

1 

4 

0 

0 

0 

1 

4 

x\  —  2 

0 

0 

0 

1 

4 

1 

4 

x\  —  6 

0 

0 

1 

4 

0 

1 

4 

Px2  [*2] 

1 

4 

1 

4 

1 

4 

1 

4 

Table  9.1:  Joint  PMF  values. 


first  determine  the  covariance  matrix  Cx  and  then  A  so  that  Y  =  AX  consists  of 
uncorrelated  random  variables.  From  Table  9.1  we  have  that 


EXl[Xi]  =  EX2[X2]  =  0 

EXl[Xl]  =  EX2  [Xf]  =  26 

ExiX2[XiX2]  =  6 


and  therefore  we  have  that 


var(Xi) 

cov(X1,X2) 


=  var(X2)  =  26 

=  6 


yielding  a  covariance  matrix 


26  6 
6  26 
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To  find  the  eigenvectors  we  need  to  first  find  the  eigenvalues  and  then  solve  (Cj  — 
AI)v  =  0  for  each  eigenvector  v.  To  determine  the  eigenvalues  we  need  to  solve  for 
A  in  the  equation  det(C^  —  AI)  =  0.  This  is 


det 


26- A  6 
6  26  - A 


0 


or 


and  has  solutions  Ai 
eigenvectors  yields 


(26  —  A)  (26  —  A)  —  36  =  0 

20  and  A2  —  32.  Then,  solving  for  the  corresponding 


(Cx  -  Ailjvi  = 


which  yields  after  normalizing  the  eigenvector  to  have  unit  length 


'  6 

6  ' 

“  Vi  ’ 

'  0  ' 

6 

6 

_  v2 

0 

vi 


1 

x/2 

1 

~V2 


Similarly, 


(Cx  -  A2I)v2 


'  -6 

6 

’  Vi  " 

'  0  ' 

6 

-6 

.  ^2 

0 

which  yields  after  normalizing  the  eigenvector  to  have  unit  length 


V2 


1 

V2 

1 

x/2 


The  modal  matrix  becomes 


v  =  [  Vi  v2  ]  = 


1 

'I 

~y/2 


1 

± 

V2 


and  therefore 


A  =  VT  = 


1  1 

V2  V2 

1  1 

L  \/2  v/2 


Hence,  the  transformed  random  vector  Y  =  AX  is  explicitly 


*1  = 


Y2 


:Xl 


x2 


V2'“  V2 

71Xl  +  Ti*2 
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and  Y\  and  Yo  are  uncorrelated  random  variables  with 

£y[Y]  =  Ey  [AX]  =  A£x[X]  =  0 

rr  rp  20  0 

C  Y  =  AC*AT  =  VTCXV  =  A  =  Qu  3U2  . 

It  is  interesting  to  note  in  this  example,  and  in  general,  that  A  is  a  rotation  matrix 
or 

.  _  cos  9  —  sin  9 
sin  9  cos  9 

where  9  —  7r/4.  The  effect  of  multiplying  a  2  x  1  vector  by  this  matrix  is  to  rotate 
the  vector  45°  in  the  counterclockwise  direction  (see  Problem  9.27).  As  seen  in 
Figure  9.2  the  values  of  X,  indicated  by  the  small  circles  and  also  given  in  Table 
9.1,  become  the  values  of  Y,  indicated  by  the  large  circles.  One  can  easily  verify 
the  rotation. 

10 
8 
6 
4 
2 

a  0 
-2 
-4 
-6 
-8 
-10 

-10  -8  -6  -4  -2  0  2  4  6  8  10 

x 

Figure  9.2:  Sample  points  for  X  (small  circles)  and  Y  (large  circles).  The  dashed 
lines  indicate  a  45°  rotation. 
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9.6  Joint  Moments  and  the  Characteristic  Function 


The  joint  moments  corresponding  to  an  iV-dimensional  PMF  are  defined  as 


Ex  ux2,...,xN[xiXl22  ...x’fi]  =  ^l1  X2  ■■■xnPX  i,X2,..;Xn[x\,x2,  ■  ■  ■  ,%n]- 


Xi  X2  Xjsr 

(9.32) 

As  usual  if  the  random  variables  are  independent,  the  joint  PMF  factors  and  there¬ 
fore 


EXuX2_Xn[X^X122  ...xl£}  =  EXi[X\'}EX2[X1*}  . . .  EXn[X1"}.  (9.33) 

The  joint  characteristic  function  is  defined  as 

• .  -,u)N)  =  Ex  Xtx2,...,xN  [exp[j'(wi^i  +  ^2X2  H - h 

(9.34) 

and  is  evaluated  as 

^Xi,X2,...,Xjv  (^1?  •  •  •  5  ^iv) 

=  zZ  ■  •  •  2L/  eMj(uiXl  +  U}2%2  H - 1-  WnXn)]pXi,X2,...,Xn[xTl,X 2,  •  •  •  ;  Xn]- 

X\  X2  XN 

In  particular,  for  independent  random  variables,  we  have  (see  Problem  9.28) 


</>XuX2,...,Xn(wuW2,  •  •  .  ,Un)  =  4*Xi  (^1)^X1  (^2)  •  •  •  4*Xi  (^n)' 


Also,  if  X  takes  on  integer  values,  the  joint  PMF  can  be  found  from  the  joint 
characteristic  function  using  the  inverse  Fourier  transform  or 

PX\ [^1?  *  *  •  5  ^Ar] 


,Xjy  (^1 5  ^2?  •  •  •  5  k^iv) 


•exp[-j(cjifci  +u2k2  H - he oNkx) 


duj\  duo 2  dcjjv 

"27"27”'^T* 


(9.35) 


All  the  properties  of  the  2-dimensional  characteristic  function  extend  to  the  general 
case.  Note  that  once  ^>j\Ti (^1 5  ^2?  •  •  •  ^n)  is  known,  the  characteristic  func¬ 
tion  for  any  subset  of  the  X^s  is  found  by  setting  ui  equal  to  zero  for  the  ones  not 
in  the  subset.  For  example,  to  find  px i,x2[^i>  ^2],  we  let  CJ3  =  cj4  =  •  •  •  =  un  =  0  in 
the  joint  characteristic  function  to  yield 


<t>Xi,x2,...,xN (wi,w2,0,0, ...  ,0)  exp[-j(wiA;i+W2^2)] 

S.  . .  ^ 


duo\  du2 

2i r  27 r  ’ 


<t>X1,X2  (ui,U2) 
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As  seen  previously,  the  joint  moments  can  be  obtained  from  the  characteristic  func¬ 
tion.  The  general  formula  is 

EXuX2_Xn[X['X1>  ...Xl£] 


1 


dh  -\-l2-) - Mat 


jh+h+-+lN  QjiQuh  _  _  qJn 


■<f>Xi,X2,...,XN  (^1,  W2,  •  •  •  ,Wjv) 


CD  l  —U)2  —  - 


-CD  N=0 

(9.36) 


9.7  Conditional  Probability  Mass  Functions 


When  we  have  an  AT-dimensional  random  vector,  many  different  conditional  PMFs 
can  be  defined.  A  straightforward  extension  of  the  conditional  PMF  py\x  encoun¬ 
tered  in  Chapter  8  is  the  conditional  PMF  of  a  single  random  variable  conditioned 
on  knowledge  of  the  outcomes  of  all  the  other  random  variables.  For  example,  it  is 
of  interest  to  study  whose  definition  is 


PxN |Xi,x2,...,Xjv-i  Ixn\x  1,  •  •  •  ?  £jV-l] 


Px  1,x2i...ixN _ 

PX i,X2,...,Xjv— i  [^T?  ^2 5  •  •  •  5  %N—  l] 


.  (9.37) 


Then  by  rearranging  (9.37)  we  have  upon  omitting  the  arguments 


PX  i,X2,...,Xjv  PXn \Xi  ,X2v..,.Xjv-i 


If  we  replace  AT  by  AT  —  1  in  (9.37),  we  have 


PxJv-i|Xi,X2,...,Xiv-2  - 


PX  uX2,...,XN-2 


(9.38) 


or 

Px  i,X2,...,XiV-l  PXjV-l|Xi,X2v..,XjV-2^^1j^2v)^iV-2* 

Inserting  this  into  (9.38)  yields 

PXi,X2,...,Xjv  PXjv|Xi,X2v..,Xjv-iPXjv-i  |Xi,X2,...,Xjv-2P^i  ,X2,...,Xjv-2  * 

Continuing  this  process  results  in  the  general  chain  rule  for  joint  PMFs  (see  also 
(4.10)) 

Px  1,X2,...,XN  =  Pxn\XuX2,...,Xn-iPXN-1\Xi,X2,...,Xn-2  *  '  -PX2 \Xj>Xi-  (9-39) 

A  particularly  useful  special  case  of  this  relationship  occurs  when  the  conditional 
PMFs  satisfies 


PXn|Xi,X2,...,Xn_i  =  Pxn|xn_i 


for  n  =  3, 4, . . . ,  N 


(9.40) 
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or  Xn  is  independent  of  X\ . . . ,  Xn_2  if  Xn-\  is  known  for  all  n  >  3.  If  we  view  n 
as  a  time  index,  then  this  says  that  the  probability  of  the  current  random  variable 
Xn  is  independent  of  the  past  outcomes  once  the  most  recent  past  outcome  Xn-i 
is  known.  This  is  called  the  Markov  property ,  which  was  described  in  Section  4.6.4. 
When  the  Markov  property  holds,  we  can  rewrite  (9.39)  in  the  particularly  simple 
form 

PXi,X2v--,Xn  PXn\Xn-i'PXn-i\X]s[-2  *  '  mPX2\X\PX\  (9.41) 

which  is  a  factorization  of  the  iV-dimensional  joint  PMF  into  a  product  of  first- order 
conditional  PMFs.  It  can  be  considered  as  the  logical  extension  of  the  factorization 
of  the  iV-dimensional  joint  PMF  of  independent  random  variables  into  the  product 
of  its  marginals.  As  such  it  enjoys  many  useful  properties,  which  are  discussed 
in  Chapter  22.  A  simple  example  of  when  (9.40)  holds  is  for  a  “running”  sum  of 
independent  random  variables  or  Xn  =  where  the  U{  s  are  independent. 

Then,  we  have 

X1  =  C/i 

x2  =  u1  +  u2  =  x1  +  u2 

X$  =  U1  +  U2  +  U$  =  X2  +  Us 


Xn  =  Xn-i  +  Un- 


For  example,  X2  is  known,  the  PMF  of  X3  =  X2  +  C/3  depends  only  on  C/3  and  not  on 
X\.  Also,  it  is  seen  from  the  definition  of  the  random  variables  that  C/3  and  U\  =  X\ 
are  independent.  Thus,  once  X2  is  known,  X3  (a  function  of  C/3)  is  independent  of 
X\  (a  function  of  Ui).  As  a  result,  pXs |x2,Xi  =  Px3 \x2  an(i  m  general 

Pxn |Xi,x2,...,xn-i  =  Px„ |X„_1  for  n  =  3, 4, . . . ,  N 

or  (9.40)  is  satisfied.  It  is  said  that  “the  PMF  of  Xn  given  the  past  samples  depends 
only  on  the  most  recent  past  sample”.  To  illustrate  this  we  consider  a  particular 
running  sum  of  independent  random  variables  known  as  a  random  walk. 

Example  9.5  —  Random  walk 

Let  Ui  for  i  =  1, 2, . . . ,  N  be  independent  random  variables  with  the  same  PMF 


pu[k]  = 


1  —p  k  =  — 1 

p  k  =  1 


Xn  =  Yi  Ui- 

i—l 


and  define 
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At  each  “time”  n  the  new  random  variable  Xn  changes  from  the  old  random  variable 
Xn_i  by  ±1  since  Xn  =  Xn-  i  +  Un.  The  joint  PMF  is  from  (9.41) 

N 

PXuX2,...,Xn  =  n  *  (»•«) 

72=1 

where  Pxi\x0  is  defined  as  px i-  But  pXn \xn-i  can  be  found  by  noting  that  Xn  = 
Xn-i  +  Un  and  therefore  if  Xn_i  =  xn-\  we  have  that 


PXn \Xn 


PUn\Xn-i  [xn  -  Xn-1 
PUn\Xn-Xn-l] 
Pu[xn  ~  xn-l] 


(step  1  -  transform  PMF) 
(step  2  -  independence) 
(Un  s  have  same  PMF). 


Step  1  results  from  the  transformed  random  variable  Y  =  X  +  c,  where  c  is  a  con¬ 
stant,  having  a  PMF  py[Vi]  —  Px[yi  —  c].  Step  2  results  from  Un  being  independent 
of  Xn-\  —  Y^i= i  Ui  since  all  the  U{  s  are  independent.  Finally,  we  have  from  (9.42) 


N 

PX [*^1?  x2,  •  •  •  5  *^iv]  |  Pu\xn  xn— 1]»  (9.43) 

72=1 

A  realization  of  the  random  variables  for  p  =  1/2  is  shown  in  Figure  9.3.  As  justified 
by  the  character  of  the  outcomes  in  Figure  9.3b,  this  random  process  is  termed  a 
random  walk.  We  will  say  more  about  this  later  in  Chapter  16.  Note  that  the 


(b)  Realization  of  Xn’s 


Figure  9.3:  Typical  realization  of  a  random  walk, 
probability  of  the  realization  in  Figure  9.3b  is  from  (9.43) 

30  30  i 

Pxux =  JJ  pu[xn  -  xn-i]  =  R  -  = 

72=1  72=1 


30 
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since  pu[—I]  =  Pu[  1]  =  1/2. 


9.8  Computer  Simulation  of  Random  Vectors 


To  generate  a  realization  of  a  random  vector  we  can  use  the  direct  method  described 
in  Section  7.11  or  the  conditional  approach  of  Section  8.7.  The  latter  uses  the  general 
chain  rule  (see  (9.39)).  We  will  not  pursue  this  further  as  the  extension  to  an  N  x  1 
random  vector  is  obvious.  Instead  we  concentrate  on  two  important  descriptors  of 
a  random  vector,  those  being  the  mean  vector  given  by  (9.15)  and  the  covariance 
matrix  given  by  (9.25).  We  wish  to  see  how  to  estimate  these  quantities.  In  practice, 
the  iV-dimensional  PMF  is  usually  quite  difficult  to  estimate  and  so  we  settle  for 
the  estimation  of  the  means  and  covariances.  The  mean  vector  is  easily  estimated 
by  estimating  each  element  by  its  sample  mean  as  we  have  done  in  Section  6.8.  Here 
we  assume  to  have  M  realizations  of  the  N  x  1  random  vector  X,  which  we  denote 
as  {xi,  X2, . . . ,  xm  }•  The  mean  vector  estimate  becomes 

(9-44> 

m—  1 


which  is  the  same  as  estimating  the  ith  component  of  i?x[X]  by  (1/M)  X)m=i[x"i]o 
where  [£];  denotes  the  ith  component  of  the  vector  4-  To  estimate  the  N  x  N 
covariance  matrix  we  first  recall  that  the  vector/matrix  definition  is 


Cx  =  Ex 


(X  -  Ex[X ])  (X  -  Ex[X]f 


This  can  also  be  shown  to  be  equivalent  to  (see  Problem  9.31) 

CX  =  Ex  [XXT]  -  (Ex[X])(Ex[X])t .  (9.45) 

We  can  now  replace  i?x[X]  by  the  estimate  of  (9.44).  To  estimate  the  N  xN  matrix 

Ex  [XXT] 

we  replace  it  by  (1/M)  J2m=i  xm*m  since  it  is  easily  shown  that  the  (i,j)  element 
of  Ex  [XXT]  is 

[£x[XXr]]ij  =  ExlXiXj }  =  EXiXj  [XiXj] 


1 

M 


M 

E 

m— 1 


xmX^ 


m= 1 


and 
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Thus  we  have  that 


M 


X™Xm  ~ 


m—  1 


which  can  also  be  written  as 


771=1 


(9.46) 


where  Ex  [X]  is  given  by  (9.44).  The  latter  form  of  the  covariance  matrix  estimate 
is  also  more  easily  implemented.  An  example  follows. 

Example  9.6  —  Decorrelation  of  random  variables  —  continued 

In  Example  9.4  we  showed  that  we  could  decorrelate  the  random  variable  compo¬ 
nents  of  a  random  vector  by  applying  the  appropriate  linear  transformation  to  the 
random  vector.  In  particular,  if  the  2x1  random  vector  X  whose  joint  PMF  is 
given  in  Table  9.1  is  transformed  to  a  random  vector  Y,  where 


then  the  covariance  matrix  for  X 


26  6 
6  26 


becomes  the  diagonal  covariance  matrix  for  Y 


20  0 
0  32 


To  check  this  we  generate  realizations  of  X,  as  explained  in  Section  7.11  and  then  use 
the  estimate  of  the  covariance  matrix  given  by  (9.46).  The  results  are  for  M  —  1000 
realizations 

25.9080  6.1077 

6.1077  25.8558 

19.7742  0.0261 

0.0261  31.9896 

and  are  near  to  the  true  covariance  matrices.  The  entire  MATLAB  program  is  given 
next. 


Cx  = 
Cy  = 
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°/0  covexample .  m 

clear  all  7.  clears  out  all  previous  variables  from  workspace 
rand( 3  state 3 ,0) ;  7#  sets  random  number  generator  to  initial  value 
M=1000; 

for  m=l:M  7*  generate  realizations  of  X  (see  Section  7.11) 
u=rand(l , 1) ; 
if  u<=0 . 25 

x(l ,m)=-8;x(2,m)=0; 
elseif  u>0 . 25&u<=0 . 5 
x(l ,m)=0;x(2,m)=-8; 
elseif  u>0.5&u<=0.75 
x(l ,m)=2;x(2,m)=6; 
else 

x  ( 1 ,  m)  =6 ;  x  (  2 ,  m)  =2 ; 

end 

end 

meanx=[0  0]  } ;  7*  estimate  mean  vector  of  X 
for  m=l :M 

meanx=meanx+x( : ,m)/M; 
end 
meanx 

CX=zeros(2,2) ; 

for  m=l:M  7®  estimate  covariance  matrix  of  X 
xbar  (  :  ,  m)  =x  (  :  ,  m)  -meanx ; 

CX=CX+xbar (  :  ,m)  *xbar(  :  ,m)  VM; 

end 

CX 

A=[l/sqrt(2)  -1/sqrt (2) ; 1/sqrt (2)  l/sqrt(2)] ; 
for  m=l:M  7«  transform  random  vector  X 
y( : ,m)=A*x( : ,m)  ; 

end 

meany=[0  0]  3 ;  7®estimate  mean  vector  or  Y 
for  m=l :M 

meany=meany+y  (  :  ,  m)  /M ; 
end 
me  any 

CY=zeros(2,2) ; 

for  m=l:M  7«  estimate  covariance  matrix  of  Y 
ybar(:  ,m)=y(:  ,m)-meany; 

CY=CY+ybar(:  ,m)*ybar(:  ,m)  VM; 

end 

CY 


0 
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9.9  Real-World  Example  —  Image  Coding 


The  methods  for  digital  storage  and  transmission  of  images  is  an  important  consid¬ 
eration  in  the  modern  digital  age.  One  of  the  standard  procedures  used  to  convert 
an  image  to  its  digital  representation  is  the  JPEG  encoding  format  [Sayood  1996]. 
It  makes  the  observation  that  many  images  contain  portions  that  do  not  change 
significantly  in  content.  Such  would  be  the  case  for  the  image  of  a  house  in  which 
the  color  and  texture  of  the  siding,  whether  it  be  aluminum  siding  or  clapboards, 
is  relatively  constant  as  the  image  is  scanned  in  the  horizontal  direction.  To  store 
and  transmit  all  this  redundant  information  is  costly  and  time  consuming.  Hence, 
it  is  desirable  to  reduce  the  image  to  its  basic  set  of  information.  Consider  a  gray 
scale  image  for  simplicity.  Each  pixel,  which  is  a  dot  of  a  given  intensity  level,  is 
modeled  as  a  random  variable.  For  the  house  image  example,  note  that  for  the 
siding  pixels,  the  random  variables  are  heavily  correlated.  For  example,  if  X\  and 
X2  denote  neighboring  pixels  in  the  horizontal  direction,  then  we  would  expect  the 
correlation  coefficient  px  1,x2  —  1-  If  this  is  the  case,  then  we  know  from  Section 
7.9  that  X\  —  X2 ,  assuming  zero  mean  random  variables  in  our  model.  There  is  no 
economy  in  storing/transmitting  the  values  X\  —  x\  and  X2  =  X2  =  x\.  We  should 
just  store/transmit  X\  —  x\  and  when  it  is  necessary  to  reconstruct  the  image  let 
X2  =  X\  =  x\.  In  this  case,  there  is  no  image  degradation  in  doing  so.  If,  however, 
\pxx  ,x2|  <  1?  then  there  will  be  an  error  in  the  reconstructed  X2.  If  the  correlation 
coefficient  is  close  to  ±1,  this  error  will  be  small.  Even  if  it  is  not,  for  many  images 
the  errors  introduced  are  perceptually  unimportant.  Human  visual  perception  can 
tolerate  gross  errors  before  the  image  becomes  unsatisfactory. 

To  apply  this  idea  to  image  coding  we  will  consider  a  simple  yet  illustrative 
example.  The  amount  of  correlation  between  random  variables  is  quantified  by 
the  covariances.  In  particular,  for  multiple  random  variables  this  information  is 
embodied  in  the  covariance  matrix.  For  example,  if  N  —  3  a  covariance  matrix  of 

"  4  0  0  " 

Cx  =  0  4  3.8  (9.47) 

0  3.8  4 


indicates  that 

PX  1  ,X2  =  PX  1  ,X3  =  0 


Clearly,  then  (X\,X2)  or  (Xi,  X3)  contain  most  of  the  information.  For  more  com¬ 
plicated  covariance  matrices  these  relationships  are  not  so  obvious.  For  example, 
if 

4  15' 

1  4  5  (9.48) 

5  5  10 
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it  is  not  obvious  that  X3  =  X\  +  X2  (assuming  zero  mean  random  variables).  (This 
is  verified  by  showing  that  E[(X 3  —  ( X\  +  X2))2]  =  0  (see  Problem  9.33)). 

The  technique  of  transform  coding  [Sayood  1996]  used  in  the  JPEG  encoding 
scheme  takes  advantage  of  the  correlation  between  random  variables.  The  particular 
version  we  describe  here  can  be  shown  to  be  an  optimal  approach  [Kramer  and 
Mathews  1956].  It  is  termed  the  Karhunen-Loeve  transform  and  an  approximate 
version  is  used  in  the  JPEG  encoding.  Transform  coding  operates  on  a  random 
vector  X  and  proceeds  as  follows: 

1.  Transform  the  random  variables  into  uncorrelated  ones  via  a  linear  transforma¬ 

tion  Y  =  AX,  where  A  is  an  invertible  N  x  N  matrix. 

2.  Discard  the  random  variables  whose  variance  is  small  relative  to  the  others  by 

setting  the  corresponding  elements  of  Y  equal  to  zero.  This  yields  a  new  N  x  1 

A 

random  vector  Y.  This  vector  would  be  stored  or  transmitted.  (Of  course, 
the  zero  vector  elements  would  not  require  encoding,  thereby  effecting  data 
compression.  Their  locations,  though,  would  need  to  be  specified.) 

3.  Transform  back  to  X  =  A-1Y  to  recover  an  approximation  to  the  original  ran- 

A 

dom  variables  (if  the  values  Y  were  stored  then  this  would  occur  upon  retrieval 
or  if  they  were  transmitted,  this  would  occur  at  the  receiver) . 

By  decorrelating  the  random  variables  first  it  becomes  obvious  which  components 
can  be  discarded  without  significantly  affecting  the  reconstructed  vector.  To  accom¬ 
plish  the  first  step  we  have  already  determined  that  a  suitable  decorrelation  matrix 
is  V7  .  where  V  is  the  matrix  of  eigenvectors  of  Cy  ■  Thus,  we  have  that 

Cy  =  ACXAT 

=  VTCXV 

var(Yi)  0  0 

=  A  =  0  var(Y2)  0 

0  0  var(Y3) 

We  now  carry  out  the  transform  coding  procedure  for  the  covariance  matrix  of 
(9.48).  This  is  done  numerically  using  MATLAB.  The  statement  [V  Lambda]  =eig(CX) 
will  produce  the  matrices  V  and  A,  as 

‘  0.4082  -0.7071  0.5774  ' 

V  =  0.4082  -0.7071  0.5774 

_  0.8165  0  -0.5774  _ 

'  15  0  0  ‘ 

=  0  3  0. 

0  0  0 


A 
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Hence,  var ( Y3 )  =  A3  =  0  so  that  we  discard  it  by  setting  I3  =  0  and  therefore 

1  1  0  0 

v2  =  0  10 

oj 

B 

Y 

The  reconstructed  random  vector  becomes  with  A  =  VT 

/V  -J  A  /V 

X  =  A-1Y  =  VY 

=  VBY 
=  VBVtX 

and  since 

2  _i  1  1 

3  33 

I  21 

3  3  3 

1  12 

3  3  3  J 

we  have  that 

'  §X!  -  +  !*,  ' 

X  =  +  §x2  +  \x3 

-  +  \X2  +  lX3  . 

Xi  ■ 

X2  (using  A" 3  =  X\  +  X2,  see  Problem  9.33) 

Xi  +  x2 . 

■  Xi 

=  x2 

.  X3 

A 

Here  we  see  that  the  reconstructed  vector  X  is  identical  to  the  original  one.  Gen¬ 
erally,  however,  there  will  be  an  error.  For  the  covariance  matrix  of  (9.47)  there 
will  be  an  error  since  X2  and  X3  are  not  perfectly  correlated.  For  that  covariance 
matrix  the  eigenvector  and  eigenvalue  matrices  are 

0  1  O' 

0.7071  0  0.7071 

0.7071  0  -0.7071  _ 

'  7.8  0  O' 

A  =  0  4  0 

0  0  0.2 


VBVt  = 
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and  it  is  seen  that  the  decorrelated  random  variables  all  have  a  nonzero  variance 
(recall  that  var(Y^)  =  A i).  This  indicates  that  no  component  of  Y  can  be  discarded 
without  causing  an  error  upon  reconstruction.  By  discarding  Y3,  which  has  the 
smallest  variance,  we  will  incur  the  least  amount  of  error.  Doing  so  produces  the 
reconstructed  random  vector 


X 


vbvtx 

r  1  0  0 


which  becomes 


X  = 


X! 

X2+X3 

2 

X2  +  X3 


It  is  seen  that  the  components  X2  and  X3  are  replaced  by  their  averages.  This  is  due 
to  the  nearly  unity  correlation  coefficient  coefficient  {px2,xz  —  0.95)  between  these 
components.  As  an  example,  we  generate  20  realizations  of  X  as  shown  in  Figure 
9.4a,  where  the  first  realization  is  displayed  in  samples  1, 2, 3;  the  second  realization 
in  samples  4,5,6,  etc.  The  reconstructed  realizations  are  shown  in  Figure  9.4b. 


4 


2 


0 
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-4 
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n  •;  11 
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n  >  n  an 

II 

j  VjK  V  j 

•  <  > 

1  ILi 
IP  : 

:  n 

i  in 
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*  *  i 
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•T  ;  ITT 

1  nr 
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tei 

: 

j - 1  i - i-  ,  ,  1 


20  30  40  50  60 

Sample 


(a)  Original 


(b)  Reconstruction 


Figure  9.4:  Realizations  of  original  random  vector  {xi,X2, . . .  ,X2o}  and  recon¬ 
structed  random  vectors  {xi,X2, . . .  ,  X20}.  The  displayed  samples  shown  are  com¬ 
ponents  of  xi,  followed  by  components  of  X2,  etc. 

Finally,  the  error  between  the  two  is  shown  in  Figure  9.5.  Note  that  the  total  average 


276 


CHAPTER  9.  DISCRETE  N -DIMENSIONAL  RANDOM  VARIABLES 


Figure  9.5:  Error  between  original  random  vector  realizations  and  reconstructed 
ones  shown  in  Figure  9.4. 


squared  error  or  the  total  mean  square  error  (MSE)  is  given  by  ]T)?=1  —  A^)2] 

which  is 


Total  mse  = 


E[(Xx  -  X{)2  +  (X2  -  X2f  +  (X3  -  X3 )2] 

E[{X2  -  {X2  +  X3)/2)2}  +  E[(X3  -  (X2  +  X3)/2)2] 
E[((X2  -  X3)/2)2]  +  E[((X3  -  X2)/2)2] 

\e[{x 2  -  X3)2) 

^[var(X2)  +  var(X3)  -  2cov(X2,X3)] 

^[4  +  4  -2(3.8)]  =0.2. 


This  total  MSE  is  estimated  by  taking  the  sum  of  the  squares  of  the  values  in  Figure 
9.5  and  dividing  by  20,  the  number  of  vector  realizations.  Also,  note  what  the  total 
MSE  would  have  been  if  px2,x3  =  1. 

Finally,  to  appreciate  the  error  in  terms  of  human  vision  perception,  we  can 
convert  the  realizations  of  X  and  X  into  an  image.  This  is  shown  in  Figure  9.6. 
The  grayscale  bar  shown  at  the  right  can  be  used  to  convert  the  various  shades  of 
gray  into  numerical  values.  Also,  note  that  as  expected  (see  Cx  in  (9.47))  X\  is 
uncorrelated  with  X2  and  X3 ,  while  X2  and  X3  are  heavily  correlated  in  the  upper 
image.  In  the  lower  image  X2  and  X3  have  been  replaced  by  their  average. 
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2  4  6  8  10  12  14  16  18  20 


Figure  9.6:  Realizations  of  original  random  vector  and  reconstructed  random  vectors 
displayed  as  gray-scale  images.  The  upper  image  is  the  original  and  the  lower  image 
is  the  reconstructed  image. 
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Problems 

9.1  (o)  (w)  A  retired  person  gets  up  in  the  morning  and  decides  what  to  do  that 
day.  He  will  go  fishing  with  probability  0.3,  or  he  will  visit  his  daughter  with 
probability  0.2,  or  else  he  will  stay  home  and  tend  to  his  garden.  If  the  decision 
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that  he  makes  each  day  is  independent  of  the  decisions  made  on  the  other  days, 
what  is  the  probability  that  he  will  go  fishing  for  3  days,  visit  his  daughter  for 
2  days,  and  garden  for  2  days  of  the  week? 

9.2  (f,c)  Compute  the  values  of  a  multinomial  PMF  if  N  =  3,  M  =  4,  p\  =  0.2, 

and  p2  =  0.4  for  all  possible  Do  the  sum  of  the  values  equal  one? 

Hint:  You  will  need  a  computer  to  do  this. 

9.3  (t)  Prove  the  multinomial  formula  given  by  (9.5)  for  N  =  3  by  the  following 

method.  Use  the  binomial  formula  to  yield 


M 


(01  +  &)"  =  £ 


Ml 


M-W 


Then  let  b  =  a,2  +  <23  so  that  upon  using  the  binomial  formula  again  we  have 


M-k\ 


M-ki 


(a2  +  a3)M  kl  =  ^2 


fc k2\(M  -  ki  —  k2)\  2  3 


Finally,  rearrange  the  sums  and  note  that  k3  =  M  —  k\  —  k2  so  that  there  is 
actually  only  a  double  sum  in  (9.5)  for  N  =  3  due  to  this  constraint. 

9.4  (^)  (f)  Is  the  following  function  a  valid  PMF? 


i  /l\^2  0,1,... 

Pxux2,x3[h,k2,k3,]  =  -  ( -J  f-J  k2  =  0,1,... 

=  1, 0, 1. 

9.5  (w)  For  the  joint  PMF 

k\  =  0, 1, . . . 

Pxi,x2,x3[h,k2,k3]  =  (1  -  a)(l  -  5)(1  -  c)aklbk2ck3  k2  =  0, 1, . . . 

k3  =  0, 1, . . . 

where  0  <  a  <  1,  0  <  6  <  1,  and  0  <  c  <  1,  find  the  marginal  PMFs  p\i  ■  Px2 
and  px3  • 

9.6  (^)  (w)  For  the  joint  PMF  given  below  are  there  any  subsets  of  the  random 

variables  that  are  independent  of  each  other? 


PXi,x2,x3[ki,k2,k3  = 


(  M 
\h,k2 


Pkilpk22{l-Pz)p3i 


k\  =  0,1 , ...  ,M 
k2  =  M  —  ki 
k3  =  0, 1, . . . 


where  0  <  pi  <  1,  p2  =  1  —  pi,  and  0  <  p3  <  1. 
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9.7(f)  A  random  vector  X  with  the  joint  PMF 


Px i,x2,x3[ki,k2,  £3]  =  exp[-(Ai  +  A2  +  A3) 


\ki  \k2  \k3 


k\\k2\k$\ 


k\  =  0, 1, . . . 
ft2  =  0,  1,  .  .  . 
&3  —  0,  1,  .  .  . 


is  transformed  according  to  Y  =  AX  where 


1  0  0 
1  1  0 
1  1  1 


Find  the  joint  PMF  of  Y. 


9.8  (t)  Prove  that 


exp  (juk) 


duo 

2tt 


0  k^O 

1  k  =  0. 


Hint:  Expand  exp (juk)  into  its  real  and  imaginary  parts  and  note  that  f  (g(cu)+ 
jh(u)))doj  =  f  g(w)dcj  +  j  f  h{uS)duj. 


9.9  (t)  Prove  that  the  sum  of  N  independent  Poisson  random  variables  with  X{  ~ 
Pois(A^)  for  i  =  1,2,... ,  TV  is  again  Poisson  distributed  but  with  parameter 
A  =  J2iLi  Hint:  See  Section  9.4. 


9.10  (^)  (w)  The  components  of  a  random  vector  X  =  [X\  X2  . . .  Xn]t  all  have 
the  same  mean  Ex[X]  and  the  same  variance  var(X).  The  “sample  mean” 
random  variable 


is  formed.  If  the  X^s  are  independent,  find  the  mean  and  variance  of  X.  What 
happens  to  the  variance  as  TV  — ►  00?  Does  this  tell  you  anything  about  the 
PMF  of  X  as  N  — y  00? 


9.11  (w)  Repeat  Problem  9.10  if  we  know  that  each  X{  ~  Ber(p).  How  can  this 
result  be  used  to  motivate  the  relative  frequency  interpretation  of  probability? 

9.12  (f)  If  the  covariance  matrix  of  a  3  x  1  random  vector  X  is 


10  1" 
0  2  2 
1  2  4 


find  the  correlation  coefficients  pxltx2i  Px l)x3,  and  px2,x3- 
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9.13  (o)  (w)  A  2  x  1  random  vector  is  given  by 


U 
2  U 


where  var({7)  =  1.  Find  the  covariance  matrix  for  X.  Next  find  the  correlation 
coefficient  px\,x2 •  Finally,  compute  the  determinant  of  the  covariance  matrix. 
Is  the  covariance  matrix  positive  definite?  Hint:  A  positive  definite  matrix 
must  have  a  positive  determinant. 


9.14  (t)  Prove  (9.26)  by  noting  that 

N  N 

aTCxa  =  YTEi  aiajcov(xi,  xj)- 

i=l  j- 1 

9.15  (f)  For  the  covariance  matrix  given  in  Problem  9.12,  find  var(Xi  +  X2  +  X3). 

9.16  (t)  Is  it  ever  possible  that  var(Xi  +  X2)  =  var(Xi)  without  X2  being  a  con¬ 
stant? 


9.17  (^)  (w)  Which  of  the  following 
and  why? 


matrices  are  not  valid  covariance  matrices 


a. 


1  2 
2  1 


-1  0 
0  -1 


c. 


2  1 
1  2 


1 

1 


9.18  (f)  A  positive  semidefinite  matrix  A  must  have  det(A)  >  0.  Since  a  covari¬ 
ance  matrix  must  be  positive  semidefinite,  use  this  property  to  prove  that  the 
correlation  coefficient  satisfies  \pxx,x2\  <  1-  Hint:  Consider  a  2  x  2  covariance 
matrix. 


9.19  (f)  If  a  random  vector  X  is  transformed  according  to 


Yi  =  X1 

y2  =  Xi  +  X2 


and  the  mean  of  X  is 

£x[x]  = 


find  the  mean  of  Y  =  [Y]  Y2]7'. 


3 

4 


9.20  (^)  (f)  If  the  random  vector  X  given  in  Problem  9.19  has  a  covariance  matrix 


2  1 
1  2 


find  the  covariance  matrix  for  Y  =  [Yj  Y2Y • 
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9.21  (t)  For  N  =  2  show  that  the  covariance  matrix  may  be  defined  as 

Cx  =  Ex  [(X  -  £x[X])(X  -  £x[X])t]  . 

Hint:  Recall  that  the  expected  value  of  a  matrix  is  the  matrix  of  the  expected 
values  of  its  elements. 

9.22  (t)  In  this  problem  you  are  asked  to  prove  that  if  Y  =  AX,  where  both  X  and 
Y  are  N  x  1  random  vectors  and  A  is  an  N  x  N  matrix,  then  Ey  [Y  ]  =  A  Ex  [X] . 
If  we  let  [AL  be  the  (z,  j)  element  of  A,  then  you  will  need  to  prove  that 

N 

[fiY(Y]]i  =  ^[AyiSxlX]],.. 

3= 1 

This  is  because  if  b  =  Ax,  then  6*  =  Y^jLi  aijxj,  for  i  =  1, 2, . .  * ,  N  where  6* 
is  the  zth  element  of  b  and  aij  is  the  (z,  j)  element  of  A. 

9.23  (t)  In  this  problem  we  prove  that 

JSx[AG(X)At]  =  A£x[G(X)]At 

where  A  is  an  N  x  N  matrix  and  G(X)  is  an  N  x  N  matrix  whose  elements 
are  all  functions  of  X.  To  do  so  we  note  that  if  A,B,C,D  are  all  N  x  N 
matrices  then  D  —  ABC  is  an  N  x  N  matrix  with  (i,  l)  element 

N 

[D]t/  =  X)[AB]«[C]W 

k=l 

N  /  N  \ 

=  ^2 1  ]  [c]m 

k= 1  \i=l  ) 

N  N 

=  y.  t  A]jj  [b]  [c]  ki . 

k=l  j= 1 

Using  this  result  and  replacing  A  by  itself,  B  by  G(X),  and  C  by  AT  will 
allow  the  desired  result  to  be  proven. 

9.24  (f)  Prove  (9.29)  and  (9.30)  for  the  case  of  N  =  2  by  letting 


A  = 

an  a  12 

021  0,22 

W 1 

r  b?  i 

bi  = 

- 1 

T—f 

l 

lO 

_ 1 

cr 

to 

II 

b( 2) 

.  °2  J 

'  4‘> ' 

'  42>  ‘ 

di  = 

d2  = 
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and  multiplying  out  all  the  matrices  and  vectors.  Then,  verify  that  the  re¬ 
lationships  are  true  by  showing  that  the  elements  of  the  resultant  N  x  N 
matrices  are  identical. 


9.25  (c)  Using  MATLAB,  find  the  eigenvectors  and  corresponding  eigenvalues  for 
the  covariance  matrix 

f  26  6  " 

[  6  26 

To  do  so  use  the  statement  [V  Lambda]  =eig(CX). 

9.26  (0)  (f,c)  Find  a  linear  transformation  to  decorrelate  the  random  vector  X  = 
[X\  X2]t  that  has  the  covariance  matrix 


10  6 
6  20  ' 


What  are  the  variances  of  the  decorrelated  random  variables? 

9.27  (t)  Prove  that  an  orthogonal  matrix,  i.e.,  one  that  has  the  property  UT  = 
U-1,  rotates  a  vector  x  to  a  new  vector  y.  Do  this  by  letting  y  =  Ux  and 
showing  that  the  length  of  y  is  the  same  as  the  length  of  x.  The  length  of  a 

vector  is  defined  to  be  ||x||  =  VxTx  =  yjx\  +  x\  H - V  x2N. 

9.28  (t)  Prove  that  if  the  random  variables  X\:  X2, . . . ,  Xjy  are  independent,  then 
the  joint  characteristic  function  factors  as 

<1>XuX2,...,Xn(wuW2,---Wn)  =  <l>Xi(wi )<f>X2(u2)  •  •  •  <I>Xn(wn)- 

Alternatively,  if  the  joint  characteristic  function  factors,  what  does  this  say 
about  the  random  variables  and  why? 

9.29  (f)  For  the  random  walk  described  in  Example  9.5  find  the  mean  and  the 
variance  of  Xn  as  a  function  of  n  if  p  =  3/4.  What  do  they  indicate  about  the 
probable  outcomes  of  Xi,  X2, . . . ,  Xn? 

9.30  (c)  For  the  random  walk  of  Problem  9.29  simulate  several  realizations  of  the 
random  vector  X  =  [X\  X2 . .  -Xx]T  and  plot  these  as  xn  versus  n  for  n  = 
1,2,...,  iV  =  50.  Does  the  appearance  of  the  outcomes  corroborate  your 
results  in  Problem  9.29?  Also,  compare  your  results  to  those  shown  in  Figure 
9.3b. 

9.31  (t)  Prove  the  relationship  given  by  (9.45)  as  follows.  Consider  the  (ij)  ele¬ 
ment  of  Cx,  which  is  co v(Xi,Xj)  =  EXiiXj  [XiXj]  -  EXi  [X^Ex^Xj].  Then, 
show  that  the  latter  is  just  the  (i,j)  element  of  the  right-hand  side  of  (9.45). 
Recall  the  definition  of  the  expected  value  of  a  matrix/vector  as  the  ma¬ 
trix/vector  of  expected  values. 
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9.32  (c)  A  random  vector  is  defined  as  X  =  [X\  X2  . . .  Xjsr]T ,  where  each  compo¬ 
nent  is  Xi  ~  Ber(l/2)  and  all  the  random  variables  are  independent.  Since 
the  random  variables  are  independent,  the  covariance  matrix  should  be  di¬ 
agonal.  Using  MATLAB,  generate  realizations  of  X  for  N  =  10  by  using 
x=floor(rand(10,l)+0.5)  to  generate  a  single  vector  realization.  Next  gen¬ 
erate  multiple  random  vector  realizations  and  use  them  to  estimate  the  covari¬ 
ance  matrix.  Presumably  the  random  numbers  that  MATLAB  produces  are 
“pseudo-independent”  and  hence  “pseudo- uncorrelated” .  Does  this  appear  to 
be  the  case?  Hint:  Use  the  MATLAB  command  mesh(CXest)  to  plot  the 
estimated  covariance  matrix  CXest. 


9.33  (w)  Prove  that  if  Xi,X2,Xs  are  zero  mean  random  variables,  then  E[(X 3  — 
(Xi  +  X2))2}  —  0  for  the  covariance  matrix  given  by  (9.48). 

9.34  (t)  In  this  problem  we  explain  how  to  generate  a  computer  realization  of  a 
random  vector  with  a  given  covariance  matrix.  This  procedure  was  used  to 
produce  the  realizations  shown  in  Figure  9.4a.  For  simplicity  the  desired  N  x  1 
random  vector  X  is  assumed  to  have  a  zero  mean  vector.  The  procedure  is 
to  first  generate  an  N  x  1  random  vector  U  whose  elements  are  zero  mean, 
uncorrelated  random  variables  with  unit  variances  so  that  its  covariance  matrix 
is  I.  Then  transform  U  according  to  X  =  BU,  where  B  is  an  appropriate 
N  x  N  matrix.  The  matrix  B  is  obtained  from  the  N  x  N  matrix  VX  whose 
elements  are  obtained  from  the  eigenvalue  matrix  A  of  Cx  by  taking  the 
square  root  of  the  elements  of  A,  and  V,  where  V  is  the  eigenvector  matrix  of 
Cx,  to  form  B  =  V\/A.  Prove  that  the  covariance  matrix  of  BU  will  be  Cx- 

9.35  (o)  (f)  Using  the  results  of  Problem  9.34  find  a  matrix  transformation  B  of 
V  =  [Ui  U2 ]T,  where  C u  =  I,  so  that  X  =  BU  has  the  covariance  matrix 


4  1 
1  4 


9*36  (o)  (c)  Generate  30  realizations  of  a  2  x  1  random  vector  X  that  has  a  zero 
mean  vector  and  the  covariance  matrix  given  in  Problem  9.35.  To  do  so  use 
the  results  from  Problem  9.35.  For  the  random  vector  U  assume  that  U\  and 
U2  are  uncorrelated  and  have  the  same  PMF 


pu[k]  = 


1 

2 

1 

2 


k  =  —  1 
k  =  1. 


Note  that  the  mean  of  U  is  zero  and  the  covariance  matrix  of  U  is  I.  Next 
estimate  the  covariance  matrix  Cx  using  your  realizations  and  compare  it  to 
the  true  covariance  matrix. 
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Continuous  Random  Variables 


10.1  Introduction 

In  Chapters  5-9  we  discussed  discrete  random  variables  and  the  methods  employed 
to  describe  them  probabilistically.  The  principal  assumption  necessary  in  order  to 
do  so  is  that  the  sample  space,  which  is  the  set  of  all  possible  outcomes,  is  finite  or 
at  most  countably  infinite.  It  followed  then  that  a  probability  mass  function  (PMF) 
could  be  defined  as  the  probability  of  each  sample  point  and  used  to  calculate  the 
probability  of  all  possible  events  (which  are  subsets  of  the  sample  space).  Most 
physical  measurements,  however,  do  not  produce  a  discrete  set  of  values  but  rather 
a  continuum  of  values  such  as  the  rainfall  measurement  data  previously  shown  in 
Figures  1.1  and  1.2.  Another  example  is  the  maximum  temperature  measured  during 
the  day,  which  might  be  anywhere  between  20°F  and  60°F.  The  number  of  possible 
temperatures  in  the  interval  [20, 60]  is  infinite  and  uncountable.  Therefore,  we  cannot 
assign  a  valid  PMF  to  the  temperature  random  variable.  Of  course,  we  could  always 
choose  to  “round  off”  the  measurement  to  the  nearest  degree  so  that  the  possible 
outcomes  would  then  become  {20, 21, . . . ,  60}.  Then,  many  valid  PMFs  could  be 
assigned.  But  this  approach  compromises  the  measurement  precision  and  so  is  to 
be  avoided  if  possible.  What  we  are  ultimately  interested  in  is  the  probability  of 
any  interval ,  such  as  the  probability  of  the  temperature  being  in  the  interval  [20, 25] 
or  [55, 60]  or  the  union  of  intervals  [20, 25]  U  [55, 60].  To  do  so  we  must  extend  our 
previous  approaches  to  be  able  to  handle  this  new  case.  And  if  we  later  decide  that 
less  precision  is  warranted,  such  that  the  rounding  of  20.6°  to  21°  is  acceptable,  we 
will  still  be  able  to  determine  the  probability  of  observing  21°.  To  do  so  we  can 
regard  the  rounded  temperature  of  21°  as  having  arisen  from  all  temperatures  in 
the  interval  A  —  [20.5,21.5).  Then,  Pfrounded  temperature  =  21]  =  P[A],  so  that 
we  have  lost  nothing  by  considering  a  continuum  of  outcomes  (see  Problem  10.2). 

Chapters  10-14  discuss  continuous  random  variables  in  a  manner  similar  to 
Chapters  5-9  for  discrete  random  variables.  Since  many  of  the  concepts  are  the 
same,  we  will  not  belabor  the  discussion  but  will  concentrate  our  efforts  on  the  al- 
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gebraic  manipulations  required  to  analyze  continuous  random  variables.  It  may  be 
of  interest  to  note  that  discrete  and  continuous  random  variables  can  be  subsumed 
under  the  topic  of  a  general  random  variable.  There  exists  the  mathematical  ma¬ 
chinery  to  analyze  both  types  of  random  variables  simultaneously.  This  theory  is 
called  measure  theory  [Capinski,  Kopp  2004].  It  requires  an  advanced  mathematical 
background  and  does  not  easily  lend  itself  to  intuitive  interpretations.  An  alterna¬ 
tive  means  of  describing  the  general  random  variable  that  appeals  more  to  engineers 
and  scientists  makes  use  of  the  Dirac  delta  function.  This  approach  is  discussed 
later  in  this  chapter  under  the  topic  of  mixed  random  variables . 

In  the  course  of  our  discussions  we  will  revisit  some  of  the  concepts  alluded  to  in 
Chapters  1  and  2.  With  the  appropriate  mathematical  tools  we  will  now  be  able  to 
define  these  concepts.  Hence,  the  reader  may  wish  to  review  the  relevant  sections 
in  those  chapters. 


10.2  Summary 

The  definition  of  a  continuous  random  variable  is  given  in  Section  10.3  and  illus¬ 
trated  in  Figure  10.1.  The  probabilistic  description  of  a  continuous  random  variable 
is  the  probability  density  function  (PDF)  px(%)  with  its  interpretation  as  the  prob¬ 
ability  per  unit  length.  As  such  the  probability  of  an  interval  is  given  by  the  area 
under  the  PDF  (10.4).  The  properties  of  a  PDF  are  that  it  is  nonnegative  and 
integrates  to  one,  as  summarized  by  Properties  10.1  and  10.2  in  Section  10.4.  Some 
important  PDFs  are  given  in  Section  10.5,  such  as  the  uniform  (10.6),  the  exponen¬ 
tial  (10.5),  the  Gaussian  or  normal  (10.7),  the  Laplacian  (10.8),  the  Cauchy  (10.9), 
the  Gamma  (10.10),  and  the  Rayleigh  (10.14).  Special  cases  of  the  Gamma  PDF 
are  the  exponential,  the  chi-squared  (10.12),  and  the  Erlang  (10.13).  The  cumu¬ 
lative  distribution  function  (CDF)  for  a  continuous  random  variable  is  defined  the 
same  as  for  the  discrete  random  variable  and  is  given  by  (10.16).  The  corresponding 
CDFs  for  the  PDFs  of  Section  10.5  are  given  in  Section  10.6.  In  particular,  the 
CDF  for  the  standard  normal  is  denoted  by  $(rr)  and  is  related  to  the  Q  function 
by  (10.17).  The  latter  function  cannot  be  evaluated  in  closed  form  but  may  be 
found  numerically  using  the  MATLAB  subprogram  Q  .m  listed  in  Appendix  10B.  An 
approximation  to  the  Q  function  is  given  by  (10.23).  The  CDF  is  useful  in  that 
probabilities  of  intervals  are  easily  found  via  (10.25)  once  the  CDF  is  known.  The 
transformation  of  a  continuous  random  variable  by  a  one-to-one  function  produces 
the  PDF  of  (10.30).  If  the  transformation  is  many-to-one,  then  (10.33)  can  be  used 
to  determine  the  PDF  of  the  transformed  random  variable.  Mixed  random  variables, 
ones  that  exhibit  nonzero  probabilities  for  some  points  but  are  continuous  otherwise, 
are  described  in  Section  10.8.  They  can  be  described  by  a  PDF  if  we  allow  the  use 
of  the  Dirac  delta  function  or  impulse.  For  a  general  mixed  random  variable  the 
PDF  is  given  by  (10.36).  To  generate  realizations  of  a  continuous  random  variable 
on  a  digital  computer  one  can  use  a  transformation  of  a  uniform  random  variable 


10.3.  DEFINITION  OF  A  CONTINUOUS  RANDOM  VARIABLE 


287 


as  summarized  in  Theorem  10.9.1.  Examples  are  given  in  Section  10.9.  Estimation 
of  the  PDF  and  CDF  can  be  accomplished  by  using  (10.38)  and  (10.39).  Finally,  an 
example  of  the  application  of  the  theory  to  the  problem  of  speech  clipping  is  given 
in  Section  10.10. 

10.3  Definition  of  a  Continuous  Random  Variable 

A  continuous  random  variable  X  is  defined  as  a  mapping  from  the  experimental 
sample  space  S  to  a  numerical  (or  measurement)  sample  space  Sx,  which  is  a  subset 
of  the  real  line  R1.  In  contrast  to  the  sample  space  of  a  discrete  random  variable, 
Sx  consists  of  an  infinite  and  uncountable  number  of  outcomes.  As  an  example, 
consider  an  experiment  in  which  a  dart  is  thrown  at  the  circular  dartboard  shown  in 
Figure  10.1.  The  outcome  of  the  dart-throwing  experiment  is  a  point  S\  in  the  circle 


I — ►  * 
1 

5*  =  [0, 1] 


Figure  10.1:  Mapping  of  the  outcome  of  a  thrown  dart  to  the  real  line  (example  of 
continuous  random  variable). 

of  radius  one.  The  distance  from  the  bullseye  (center  of  the  dartboard)  is  measured 
and  that  value  is  assigned  to  the  random  variable  as  X(si)  =  x\.  Clearly  then, 
the  possible  outcomes  of  the  random  variable  are  in  the  interval  [0, 1] ,  which  is  an 
uncountably  infinite  set.  We  cannot  assign  a  nonzero  probability  to  each  value  of 
X  and  expect  the  sum  of  the  probabilities  to  be  one.  One  way  out  of  this  dilemma 
is  to  assign  probabilities  to  intervals,  as  was  done  in  Section  3.6.  There  we  had  a 
one-dimensional  dartboard  and  we  assigned  a  probability  of  the  dart  landing  in  an 
interval  to  be  the  length  of  the  interval.  Similarly,  for  our  problem  if  each  value  of 
X  is  equally  likely  so  that  intervals  of  the  same  length  are  equally  likely,  we  could 
assign 

P[a  <X<b]  =  b  —  a  0<a<b<l  (10.1) 

for  the  probability  of  the  dart  landing  in  the  interval  [a,  b].  This  probability  assign¬ 
ment  satisfies  the  probability  axioms  given  in  Section  3.6  and  so  would  suffice  to 
calculate  the  probability  of  any  interval  or  union  of  disjoint  intervals  (use  Axiom  3 
for  disjoint  intervals).  But  what  would  we  do  if  the  probability  of  all  equal  length 
intervals  were  not  the  same?  For  example,  a  champion  dart  thrower  would  be  more 
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likely  to  obtain  a  value  near  x  =  0  than  near  x  =  1.  We  therefore  need  a  more 
general  approach.  For  discrete  random  variables  it  was  just  as  easy  to  assign  PMFs 
that  were  not  uniform  as  ones  that  were  uniform.  Our  goal  then  is  to  extend  this 
approach  to  encompass  continuous  random  variables.  We  will  do  so  by  examining 
the  approximation  afforded  by  using  the  PMF  to  calculate  interval  probabilities  for 
continuous  random  variables. 

Consider  first  a  possible  approximation  of  (10.1)  by  a  uniform  PMF  as 

Px[%i]  —  xi  —  iAx  for  i  =  1, 2,  * . . ,  M 

where  Ax  —  1/M ,  so  that  M Ax  —  1  as  shown  in  Figure  10.2.  Then  to  approximate 


x 


x 


(a)  M  =  10,  Ax  =  0.1 


(b)  M  =  20,  Ax  =  0.05 


Figure  10.2:  Approximating  the  probability  of  an  interval  for  a  continuous  random 
variable  by  using  a  PMF. 

the  probability  of  the  outcome  of  X  in  the  interval  [a,  b]  we  can  use 

J>[«  <*<»]  =  Y,  jg-  (10.2) 

{i:a<Xi<b} 

For  example,  referring  to  Figure  10.2a,  if  a  =  0.38  and  b  =  0.52,  then  there  are  two 
values  of  Xi  that  lie  in  that  interval  and  therefore  P[0.38  <  X  <  0.52]  =  2/M  =  0.2, 
even  though  we  know  that  the  true  value  from  (10.1)  is  0.14  .  To  improve  the 
quality  of  our  approximation  we  increase  M  to  M  =  20  as  shown  in  Figure  10.2b. 
Then,  we  have  three  values  of  xt  that  lie  in  the  interval  and  therefore  P[0.38  <  X  < 
0.52]  =  3/M  =  0.15,  which  is  closer  to  the  true  value.  Clearly,  if  we  let  M  — >  oo  or 
equivalently  let  Ax  — >  0,  our  approximation  will  become  exact.  Considering  again 


10.3.  DEFINITION  OF  A  CONTINUOUS  RANDOM  VARIABLE 


289 


(10.2)  with  Ax  =  1/M,  we  have 

P[a  <  X  <  b]  =  ^2  1'Ax 

{i:a<Xi<b} 

and  defining  px(x)  =  1  for  0  <  x  <  1  and  zero  otherwise,  we  can  write  this  as 

P[a  <  X  <  b]  =  ^  px(xi)Ax.  (10.3) 

{i:a<Xi<b} 


Finally,  letting  Ax  —»  0  to  yield  no  error  in  the  approximation,  the  sum  in  (10.3) 
becomes  an  integral  and  px(%i)  Px(x)  so  that 

P[a  <  X  <  b]  =  f  px{x)dx  (10-4) 

J  a 

which  gives  the  same  result  for  the  probability  of  an  interval  as  (10.1).  Note  that 
px(x)  is  defined  to  be  1  for  all  0  <  x  <  1.  To  interpret  this  new  function  px{x)  we 
have  from  (10.3)  with  xo  =  kAx  for  k  an  integer 

P[x o  —  Ax/2  <  X  <  xq  +  Ax/2] 


y~l  px(x%)  Ax 

{i'.XQ— Ax/2<Xj<cco+Ax/2} 


^2  Px(xi)Ax 

{i:xi=x  o} 

px{x  o)Ax 


(only  one  value  of  x,  within  interval) 


which  yields 


Px{x  o)  = 


P[a:o  —  Ax/2  <  X  <  xq  +  Ax/2 ] 

Ax 


This  is  the  probability  of  X  being  in  the  interval  [xo  —  Ax/2,  +  Ax/2]  divided 

by  the  interval  length  Ax.  Hence,  px{x o)  is  the  probability  per  unit  length  and  is 
termed  the  probability  density  function  (PDF).  It  can  be  used  to  find  the  probability 
of  any  interval  by  using  (10.4).  Equivalently,  since  the  value  of  an  integral  may  be 
interpreted  as  the  area  under  a  curve,  the  probability  is  found  by  determining  the 
area  under  the  PDF  curve.  This  is  shown  in  Figure  10.3.  The  PDF  is  denoted 
by  Px(x ),  where  we  now  use  parentheses  since  the  argument  is  no  longer  discrete 
but  continuous.  Also,  for  the  same  reason  we  omit  the  subscript  i,  which  was  used 
for  the  PMF  argument.  Hence,  the  PDF  for  a  continuous  random  variable  is  the 
extension  of  the  PMF  that  we  sought.  Before  continuing  we  examine  this  example 
further. 


290 


CHAPTER  10.  CONTINUOUS  RANDOM  VARIABLES 


(a)  Probability  density  function  (b)  Probability  shown  as  shaded  area 


Figure  10.3:  Example  of  probability  density  function  and  how  probability  is  found 
as  the  area  under  it. 


Example  10.1  -  PDF  for  a  uniform  random  variable  and  the  MATLAB 
command  rand 


The  PDF  given  by 


Px{x)  - 


1  0  <  x  <  1 

0  otherwise 


is  known  as  a  uniform  PDF.  Equivalently,  X  is  said  to  be  a  uniform  random  vari¬ 
able  or  we  say  that  X  is  uniformly  distributed  on  (0, 1).  The  shorthand  notation  is 
X  ~  U( 0,1).  Observe  that  this  is  the  continuous  random  variable  for  which  MAT- 
LAB  uses  rand  to  produce  a  realization.  Hence,  in  simulating  a  coin  toss  with  a 
probability  of  heads  of  p  =  0.75,  we  use  (10.4)  to  obtain 


and  choose  a  =  0  and  b  =  0.75.  The  probability  of  obtaining  an  outcome  in  the 
interval  (0,0.75]  for  a  random  variable  X  ~  U{ 0, 1)  is  now  seen  to  be  0.75.  Hence, 
the  code  below  can  be  used  to  generate  the  outcomes  of  a  repeated  coin  tossing 
experiment  with  p  =  0.75. 

for  i=l:M 
u=rand(l , 1) ; 
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if  u<=0.75 

x(i,l)=l;  #/0  head  mapped  into  1 
else 

x(i,l)=0;  */«  tail  mapped  into  0 
end 
end 


Could  we  have  used  any  other  values  for  a  and  6? 


o 

Now  returning  to  our  dart  thrower,  we  can  acknowledge  her  superior  dart-throwing 
ability  by  assigning  a  nonuniform  PDF  as  shown  in  Figure  10.4.  The  probability  of 


Px{x)  =  2(1  -  x) 


Figure  10.4:  Nonuniform  PDF. 

throwing  a  dart  within  a  circle  of  radius  0.1  or  X  6  [0,0.1]  will  be  larger  than  for 
the  region  between  the  circles  with  radii  0.9  and  1  or  X  €  [0.9, 1].  Specifically,  using 
(10.4) 


ro.i 

P[0  <  X  <  0.1]  =  /  2(1  —  x)dx  =  2(x  —  x2/2)|q  1  =  0.19 

J  0 

P[0.9  <  X  <  1]  =  J  2(1  —  x)dx  =  2(x  —  x2/2)|g  g  =  0.01. 

Note  that  in  this  example  px(x)  >  0  for  all  x  and  also  f^°  Px(x)dx  =  1.  These 
are  properties  that  must  be  satisfied  for  a  valid  PDF.  We  will  say  more  about  these 
properties  in  the  next  section. 

It  may  be  helpful  to  consider  a  mass  analogy  to  the  PDF.  An  example  is  shown 
in  Figure  10.5.  It  can  be  thought  of  as  a  slice  of  Jarlsberg  cheese  with  length  2 
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x 


2 


Figure  10.5:  Jarlsberg  cheese  slice  used  for  mass  analogy  to  PDF. 


meters,  height  of  1  meter,  and  depth  of  1  meter,  which  might  be  purchased  for  a 
New  Year’s  Eve  party  (with  a  lot  of  guests!).  If  its  mass  is  1  kilogram  (it  is  a  new 
“lite”  cheese),  then  its  overall  density  D  is 


mass 

volume 


M 

V 


1  kg 

1  m3 


=  1  kg/ 


However,  its  linear  density  or  mass  per  meter  which  is  defined  as  AM/  Ax  will  change 
with  x.  If  each  guest  is  allowed  to  cut  a  wedge  of  cheese  of  length  Ax  as  shown  in 
Figure  10.5,  then  clearly  the  hungriest  guests  should  choose  a  wedge  near  x  =  2  for 
the  greatest  amount  of  cheese.  To  determine  the  linear  density  we  compute  AM/ Ax 
versus  x.  To  do  so  first  note  that  AM  =  DAV  —  AV  and  AV  =  1  •  (area  of  face), 
where  the  face  is  seen  to  be  trapezoidal.  Thus, 


AV  =  -Az 
2 


xq  —  Ax/2  xq  +  Ax/2\  1 


+ 


) 


—  -XnAx. 
2 


Hence,  AM/ Ax  =  AV/Ax  =  xq/2  and  this  is  the  same  even  as  Ax  -*  0.  Thus, 


dM 

dx 


0<x<2 


and  to  obtain  the  mass  for  any  wedge  from  x  =  a  to  x  =  b  we  need  only  integrate 
dM/dx  to  obtain  the  mass  as  a  function  of  x.  This  yields 


rb  rb 

M([a,6])  =  /  -xdx  =  /  m(x)dx 

J a  "  J  a 

where  m(x)  =  x/2  is  the  linear  mass  density  or  the  mass  per  unit  length.  It  is 
perfectly  analogous  to  the  PDF  which  is  the  probability  per  unit  length.  Can  you 
find  the  total  mass  of  cheese  from  M([a,  6])?  See  also  Problem  10.3. 
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10.4  The  PDF  and  Its  Properties 

The  PDF  must  have  certain  properties  so  that  the  probabilities  obtained  using  (10.4) 
satisfy  the  axioms  given  in  Section  3.4.  Since  the  probability  of  an  interval  is  given 

by 

P[a  <  X  <  b]  =  I  px(x)dx 

J  a 

the  PDF  must  have  the  following  properties. 

Property  10.1  —  PDF  must  be  nonnegative. 

Px{x)  >0  —  oo  <  x  <  oo. 

Proof:  If  px{x)  <  0  on  some  small  interval  [xo  —  Ax/2,  xq  +  Ax/2],  then 

rxo+Ax/2 

P[x0  —  Ax/2  <  X  <  xo  4-  Ax/2]  =  /  px(x)dx  <  0 

J  xq—Ax/2 

which  violates  Axiom  1  that  P[E\  >  0  for  all  events  E. 

□ 


Property  10.2  —  PDF  must  integrate  to  one. 

/oo 

px(x)dx  =  1 

-OO 

Proof: 


/oo 

Px(x)dx 

-oo 

□ 

Hence,  any  nonnegative  function  that  integrates  to  one  can  be  considered  as  a  PDF. 
An  example  follows. 

Example  10.2  —  Exponential  PDF 

Consider  the  function 


MW  =  {0AeXP<-Al)  HI  (10.5) 

for  A  >  0.  This  is  called  the  exponential  PDF  and  is  shown  in  Figure  10.6.  Note 
that  it  is  discontinuous  at  x  —  0.  Hence,  a  PDF  need  not  be  continuous  (see  also 
Figure  10.3a  for  the  uniform  PDF  which  also  has  points  of  discontinuity).  Also,  for 
A  >  1,  we  have  px(0)  =  A  >  1.  In  contrast  to  a  PMF,  the  PDF  can  exceed  one  in 
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x 


Figure  10.6:  Exponential  PDF. 


value.  It  is  the  area  under  the  PDF  that  cannot  exceed  one.  As  expected  px{%)  >  0 
for  — oo  <  x  <  oo  and 

/oo  poo 

px{x)dx  =  /  A  exp  (—Ax)  do: 

-oo  J  0 

=  -exp(-Aar)|2°  =  1 


for  A  >  0.  This  PDF  is  often  used  as  a  model  for  the  lifetime  of  a  product.  For 
example,  if  X  is  the  failure  time  in  days  of  a  lightbulb,  then  P[X  >  100]  is  the 
probability  that  the  lightbulb  will  fail  after  100  days  or  it  will  last  for  at  least  100 
days.  This  is  found  to  be 


p[x  >  ioo; 


*00 


/  Aexp(— A  x)d 

J  ioo 

-exp(— Aa;)|^0 


x 


exp(— 100A) 


(  0.367  A  =  0.01 
{  0.904  A  =  0.001. 


The  probability  of  a  sample  point  is  zero. 


If  X  is  a  continuous  random  variable,  then  it  was  argued  in  Section  3.6  that  the 
probability  of  a  point  is  zero.  This  is  consistent  with  our  definition  of  a  PDF.  If  the 
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width  of  the  interval  shrinks  to  zero,  then  the  area  under  the  PDF  also  goes  to  zero. 
Hence,  P[X  =  x]  =  0.  This  is  true  whether  or  not  px{x)  is  continuous  at  the  point 
of  interest  (as  long  as  the  discontinuity  is  a  finite  jump).  In  the  previous  example 
of  an  exponential  PDF  P[X  =  0]  =  0  even  though  px{ 0)  is  discontinuous  at  x  =  0. 
This  means  that  we  could,  if  desired,  have  defined  the  exponential  PDF  as 

Aexp(— Xx)  x>0 
0  x  <  0 

for  which  px(0)  is  now  defined  to  be  0.  It  makes  no  difference  in  our  probability 
calculations  whether  we  include  x  =  0  in  the  interval  or  not.  Hence,  we  see  that 

rb  rb  rb 

/  px(x)dx=  /  px(x)dx  =  /  px{x)dx 
J  0“  J  0+  Jo 

and  in  a  similar  manner  if  X  is  a  continuous  random  variable,  then 

P[a  <  X  <  b]  =  P[a  <  X  <  b]  =  P[a  <  X  <  b]  =  P[a  <  X  <  b\. 


In  summary,  the  value  assigned  to  the  PDF  at  a  discontinuity  is  arbitrary  since 
it  does  not  affect  any  subsequent  probability  calculation  involving  a  continuous 
random  variable.  However,  for  discontinuities  other  than  step  discontinuities  (which 
are  jumps  of  finite  magnitude)  we  will  see  in  Section  10.8  that  we  must  be  more 
careful. 


10.5  Important  PDFs 

There  are  a  multitude  of  PDFs  in  use  in  various  scientific  disciplines.  The  books 
by  [Johnson,  Kotz,  and  Balakrishnan  1994]  contain  a  summary  of  many  of  these 
and  should  be  consulted  for  further  information.  We  now  describe  some  of  the  more 
important  PDFs. 


10.5.1  Uniform 


We  have  already  encountered  a  special  case  of  the  uniform  PDF  in  Figure  10.3. 
More  generally  it  is  defined  as 


Px(x)  = 


bh  a  <  x  <b 
0  otherwise 


(10.6) 


and  examples  are  shown  in  Figure  10.7.  It  is  given  the  shorthand  notation  X  ~ 
U(a,  b).  If  a  =  0  and  6=1,  then  an  outcome  of  a  U{ 0, 1)  random  variable  can  be 
generated  in  MATLAB  using  rand. 
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(a)  a  =  1,  b  —  3 


(b)  a  =  1,  b  =  6 


Figure  10.7:  Examples  of  uniform  PDF. 


10.5.2  Exponential 

This  was  previously  defined  in  Example  10.2.  The  shorthand  notation  is  X  ~ 
exp(A). 

10.5.3  Gaussian  or  Normal 

This  is  the  famous  “bell-shaped”  curve  first  introduced  in  Section  1.3.  It  is  given  by 

—  OO  <  X  <  oo  (10.7) 

where  a2  >  0  and  —  oo  <  /x  <  oo.  Its  application  in  practical  problems  is  ubiquitous. 
It  is  shown  to  integrate  to  one  in  Problem  10.9.  Some  examples  of  this  PDF  as  well 
as  some  outcomes  for  various  values  of  the  parameters  (/x,  a2)  are  shown  in  Figures 
10.8  and  10.9.  It  is  characterized  by  the  two  parameters  /x  and  a2.  The  parameter 
/x  indicates  the  center  of  the  PDF  which  is  seen  in  Figures  10.8a  and  10.8c.  It  depicts 
the  “average  value”  of  the  random  variable  as  can  be  observed  by  examining  Figures 
10.8b  and  10. 8d.  In  Chapter  11  we  will  show  that  /x  is  actually  the  mean  of  X.  The 
parameter  a2  indicates  the  width  of  the  PDF  as  is  seen  in  Figures  10.9a  and  10.9c. 
It  is  related  to  the  variability  of  the  outcomes  as  seen  in  Figures  10.9b  and  10. 9d.  In 
Chapter  11  we  will  show  that  a2  is  actually  the  variance  of  X.  The  PDF  is  called  the 
Gaussian  PDF  after  the  famous  German  mathematician  K.F.  Gauss  and  also  the 
normal  PDF,  since  “normal”  populations  tend  to  exhibit  this  type  of  distribution. 
A  standard  normal  PDF  is  one  for  which  /x  =  0  and  a2  =  1.  The  shorthand  notation 
is  X  ~  A/"(/i,cr2).  MATLAB  generates  a  realization  of  a  standard  normal  random 
variable  using  randn.  This  was  used  extensively  in  Chapter  2. 
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(a)  fi  —  0,  <t2  =  1  (b)  n  =  0,  <t2  =  1 


(c)  =  2,  a2  =  1 


(d)  p  =  2,  a2  =  1 


Figure  10.8:  Examples  of  Gaussian  PDF  with  different  /i’s. 


To  find  the  probability  of  the  outcome  of  a  Gaussian  random  variable  lying  within 
an  interval  requires  numerical  integration  (see  Problem  1.14)  since  the  integral 


exp(— (l/2)x2)dx 


cannot  be  evaluated  analytically.  A  MATLAB  subprogram  will  be  provided  and 
described  shortly  to  do  this.  The  Gaussian  PDF  is  commonly  used  to  model  noise  in 
a  communication  system  (see  Section  2.6),  as  well  as  for  numerous  other  applications. 
We  will  see  in  Chapter  15  that  the  PDF  arises  quite  naturally  as  the  PDF  of  a  large 
number  of  independent  random  variables  that  have  been  added  together. 
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(a)  fi  =  0,  cr2  =  1  (b)  fi  =  0,  a2  =  1 


(c)  n  =  0,  a2  =  2  (d)  n  —  0,  a2  =  2 


Figure  10.9:  Examples  of  Gaussian  PDF  with  different  cr2,s. 


10.5.4  Laplacian 

This  PDF  is  named  after  Laplace,  the  famous  French  mathematician.  It  is  similar 
to  the  Gaussian  except  that  it  does  not  decrease  as  rapidly  from  its  maximum  value. 
Its  PDF  is 


where  a2  >  0.  Again  the  parameter  a2  specifies  the  width  of  the  PDF,  and  will  be 
shown  in  Chapter  11  to  be  the  variance  of  X.  It  is  seen  to  be  symmetric  about  x  =  0. 
Some  examples  of  the  PDF  and  outcomes  are  shown  in  Figure  10.10.  Note  that  for 
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(c)  a2  —  4  (d)  cr2  =  4 


Figure  10.10:  Examples  of  Laplacian  PDF  with  different  a2’ s. 

the  same  a2  as  the  Gaussian  PDF,  the  outcomes  are  larger  as  seen  by  comparing 
Figure  10.10b  to  Figure  10.9b.  This  is  due  to  the  larger  probability  in  the  “tails”  of 
the  PDF.  The  “tail”  region  of  the  PDF  is  that  for  which  |ccj  is  large.  The  Laplacian 
PDF  is  easily  integrated  to  find  the  probability  of  an  interval.  This  PDF  is  used  as 
a  model  for  speech  amplitudes  [Rabiner  and  Schafer  1978]. 

10.5.5  Cauchy 

The  Cauchy  PDF  is  named  after  another  famous  French  mathematician  and  is 
defined  as 

1 

7r(l  +  X 2) 


px(x)  = 


—  OO  <  X  <  oo. 


(10.9) 
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It  is  shown  in  Figure  10.11  and  is  seen  to  be  symmetric  about  x  =  0.  The  Cauchy 
PDF  can  easily  be  integrated  to  find  the  probability  of  any  interval.  It  arises  as  the 
PDF  of  the  ratio  of  two  independent  A/"(0, 1)  random  variables  (see  Chapter  12). 


Figure  10.11:  Cauchy  PDF. 


10.5.6  Gamma 

The  Gamma  PDF  is  a  very  general  PDF  that  is  used  for  nonnegative  random  vari¬ 
ables.  It  is  given  by 


Px{x) 


1  exp(— Ax)  x  >  0 
0  x  <  0 


(10.10) 


where  A  >  0,  a  >  0,  and  T(^)  is  the  Gamma  function  which  is  defined  as 

POO 

r(z)  =  /  tz~l  exp(— t)dt.  (10.11) 

Jo 

Clearly,  the  T(a)  factor  in  (10.10)  is  the  normalizing  constant  needed  to  ensure  that 
the  PDF  integrates  to  one.  Some  examples  of  this  PDF  are  shown  in  Figure  10.12. 
The  shorthand  notation  is  X  ~  r(o,  A).  Some  useful  properties  of  the  Gamma 
function  are  as  follows. 

Property  10.3  -  T(z  +  1)  =  zT(z) 

Proof:  See  Problem  10.16. 

□ 
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x 


Q  I  i  i  f - 1 - i  .  i—  — ■ 3 

-2-1  0  1  2  3  4  5 

X 


(a)  A  =  1 


(b)  a  =  2 


Figure  10.12:  Examples  of  Gamma  PDF. 


Property  10.4  —  T(N)  —  (N  —  1)! 

Proof:  Follows  from  Property  10.3  with  z  =  N  —  1  since 

T(N)  =  (N  —  l)T(N  —  1) 

=  (N-  1)(N  -  2)T(N  -  3)  (let  z  —  N  —  2  now) 

=  (N  —  1)(N  —  2) . .  A  =  (N  —  l)\ 

□ 


Property  10.5  —  r(l/2)  =  y/ir 
Proof: 

roc 

r(l/2)  =  /  t-1/2  exp(— t)dt 

Jo 

(Note  that  near  t  =  0  the  integrand  becomes  infinite  but  t“ly/2  exp(— t)  «  t~1/2 
which  is  integrable.)  Now  let  t  =  ia2/2  and  thus  dt  =  udu  which  yields 


F(l/2) 


*oo 


0 


exp(— vr/2)udu 


»oo 

\/2exp(— u2 /2)du 


o 


y/2 


‘OO 


exp(— u2 / 2) du  (integrand  is  symmetric  about  u  =  0) 


■oo 


:v^  why? 


□ 
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The  Gamma  PDF  reduces  to  many  well  known  PDFs  for  appropriate  choices  of 
the  parameters  a  and  A.  Some  of  these  are: 

1.  Exponential  for  a  =  1 
From  (10.10)  we  have 


10  x  <  0 . 

But  r(l)  =  0!  =  1,  which  results  from  Property  10.4  so  that  we  have  the 
exponential  PDF. 

2.  Chi-squared  PDF  with  N  degrees  of  freedom  for  a  =  N/2  and  A  =  1/2 
From  (10.10)  we  have 


px(x)  = 


2  W(jv/2)  ^  ^  1  exp(— rr/2) 
0 


x  >  0 
x  <  0 . 


(10.12) 


This  is  called  the  chi-squared  PDF  with  N  degrees  of  freedom  and  is  important 
in  statistics.  It  can  be  shown  to  be  the  PDF  for  the  sum  of  the  squares  of  N 
independent  random  variables  all  with  the  same  PDF  J\f( 0, 1)  (see  Problem 
12.44).  The  shorthand  notation  is  X  ~Xn- 

3.  Erlang  for  a  =  N 

From  (10.10)  we  have 


Px  (*) 


1  exp(— Xx)  x  >  0 


I W) 

0 


x  <  0 


and  since  T(N)  =  (N  —  1)!  from  Property  10.4,  this  becomes 


Px(x)  =  { 


\N  _JV-1 


exp(-Xx)  x  >  0 

x  <  0 . 


(10.13) 


This  PDF  arises  as  the  PDF  of  a  sum  of  N  independent  exponential  random 
variables  all  with  the  same  A  (see  also  Problem  10.17). 


10.5.7  Rayleigh 

The  Rayleigh  PDF  is  named  after  the  famous  British  physicist  Lord  Rayleigh  and 
is  defined  as 

(~3^)  x>0 

x  <  0 . 


^exp 
0 


(10.14) 


It  is  shown  in  Figure  10.13.  The  Rayleigh  PDF  is  easily  integrated  to  yield  the 
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Figure  10.13:  Rayleigh  PDF  with  a2  =  1. 


probability  of  any  interval.  It  can  be  shown  to  arise  as  the  PDF  of  the  square  root 
of  the  sum  of  the  squares  of  two  independent  A/”(0, 1)  random  variables  (see  Example 
12.12). 

Finally,  note  that  many  of  these  PDFs  arise  as  the  PDFs  of  transformed  Gaussian 
random  variables.  Therefore,  realizations  of  the  random  variable  may  be  obtained 
by  first  generating  multiple  realizations  of  independent  standard  normal  or  Af( 0, 1) 
random  variables,  and  then  performing  the  appropriate  transformation.  An  alterna¬ 
tive  and  more  general  approach  to  generating  realizations  of  a  random  variable,  once 
the  PDF  is  known,  is  via  the  probability  integral  transformation  to  be  discussed  in 
Section  10.9. 

10.6  Cumulative  Distribution  Functions 

The  cumulative  distribution  function  (CDF)  for  a  continuous  random  variable  is 
defined  exactly  the  same  as  for  a  discrete  random  variable.  It  is 

Fx(x)  —  P[X  <  x\  —  oo  <  x  <  oo  (10.15) 

and  is  evaluated  using  the  PDF  as 

/X 

Px(t)dt  —  oo  <  x  <  oo.  (10.16) 

-oo 


Avoiding  confusion  in  evaluating  CDFs 


It  is  important  to  note  that  in  evaluating  a  definite  integral  such  as  in  (10.16)  it 
is  best  to  replace  the  variable  of  integration  with  another  symbol.  This  is  because 


304 


CHAPTER  10.  CONTINUOUS  RANDOM  VARIABLES 


the  upper  limit  depends  on  x  which  would  conflict  with  the  dummy  variable  of 
integration.  We  have  chosen  to  use  t  but  of  course  any  other  symbol  that  does  not 
conflict  with  x  can  be  used. 


Some  examples  of  the  evaluation  of  the  CDF  are  given  next, 


10.6.1  Uniform 

Using  (10.6)  we  have 


Fx{%) 


which  is 


Fx(x) 


0 


1 


x  <  a 
a  <  x  <  b 
x  >  b 


x  <  a 

a)  a  <  x  <  b 
x  >  b . 


An  example  is  shown  in  Figure  10.14  for  a  =  1  and  6  =  2. 


x 


Figure  10.14:  CDF  for  uniform  random  variable  over  interval  (1,2). 


10.6.2  Exponential 

Using  (10.5)  we  have 


But 


Fx(x) 

X 


0  x  <  0 

Jq  A  exp(— A t)dt  x  >  0 . 


Aexp(— A  t)dt  =  —  exp(— A<)|q  =  1  —  exp(— Xx) 


o 
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so  that 


Fx(x)  = 


0  x  <  0 

1  —  exp(— Ax)  x  >  0 . 


An  example  is  shown  in  Figure  10.15  for  A  =  1. 


x 


Figure  10.15:  CDF  for  exponential  random  variable  with  A  =  1. 

Note  that  for  the  uniform  and  exponential  random  variables  the  CDFs  are  con¬ 
tinuous  even  though  the  PDFs  are  discontinuous.  This  property  motivates  an  al¬ 
ternative  definition  of  a  continuous  random  variable  as  one  for  which  the  CDF  is 
continuous.  Recall  that  the  CDF  of  a  discrete  random  variable  is  always  discontin¬ 
uous,  displaying  multiple  jumps. 


10.6.3  Gaussian 

Consider  a  standard  normal  PDF,  which  is  a  Gaussian  PDF  with  ji  —  0  and  a2  =  1. 
(If  fj,  ^  0  and/or  a2  ^  1  the  CDF  is  a  simple  modification  as  shown  in  Problem 
10.22.)  Then  from  (10.7)  we  have 


W) dt 


—  OO  <  X  <  oo. 


This  cannot  be  evaluated  further  but  can  be  found  numerically  and  is  shown  in 
Figure  10.16.  The  CDF  for  a  standard  normal  is  usually  given  the  special  symbol 
$(x)  so  that 


—  OO  <  X  <  oo. 


Hence,  <F(x)  represents  the  area  under  the  PDF  to  the  left  of  the  point  x  as  seen 
in  Figure  10.17a.  It  is  sometimes  more  convenient,  however,  to  have  knowledge  of 
the  area  to  the  right  instead.  This  is  called  the  right-tail  probability  of  a  standard 
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x 


Figure  10.16:  CDF  for  standard  normal  or  Gaussian  random  variable. 


x  x 


(a)  Shaded  area  =  <£(1) 


(b)  Shaded  area  =  Q(  1) 


Figure  10.17:  Definitions  of  $(x)  and  Q(x)  functions. 


normal  and  is  given  the  symbol  Q(x).  It  is  termed  the  “Q”  function  and  is  defined 
as  the  area  to  the  right  of  x ,  an  example  of  which  is  shown  in  Figure  10.17b.  By  its 
definition  we  have 


Q{x) 


=  l-$(a?) 

=  fvW-H- 


—  OO  <  X  <  00 


(10.17) 

(10.18) 


and  is  shown  in  Figure  10.18,  plotted  on  a  linear  as  well  as  a  logarithmic  vertical 
scale.  Some  of  the  properties  of  the  Q  function  that  are  easily  verified  are  (see 
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(a)  Linear  vertical  scale  (b)  Logarithmic  vertical  scale  -  for  display  of 

small  values  of  Q(x) 

Figure  10.18:  Q(x )  function. 


Problem  10.25) 


Q(— oo)  = 

=  1 

(10.19) 

Q(  oo)  = 

=  0 
-g 

(10.20) 

Q(  0)  = 

1 

2 

(10.21) 

Q{-x)  = 

=  1  —  Q(x). 

(10.22) 

Although  the  Q  function  cannot  be  evaluated  analytically,  it  is  related  to  the  well 
known  “error  function” .  Thus,  making  use  of  the  latter  function  a  MATLAB  sub¬ 
program  Q.m,  which  is  listed  in  Appendix  10B,  can  be  used  to  evaluate  it.  An 
example  follows. 

Example  10.3  —  Probability  of  error  in  a  communication  system 

In  Section  2.6  we  analyzed  the  probability  of  error  for  a  PSK  digital  communication 
system.  The  probability  of  error  Pe  was  given  by 


Pe  =  P[A/2  +  W  <  0] 


where  W  ~  J\f( 0, 1).  (In  the  MATLAB  code  we  used  w=randn(l,l)  and  hence  the 
random  variable  representing  the  noise  was  a  standard  normal  random  variable.) 
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To  explicitly  evaluate  Pe  we  have  that 

Pe  =  P[A/2  +  W<0\ 

=  l-P[A/2  +  W  >0] 

=  1  -  P[W  >  -A/2] 

=  1  —  Q(—A/2)  (definition) 

=  Q(A/2)  (use  (10.22)). 


Hence,  the  true  Pe  shown  in  Figure  2.15  as  the  dashed  line  can  be  found  by  using 
the  MATLAB  subprogram  Q.m,  which  is  listed  in  Appendix  10B,  for  the  argument 
A/2  (see  Problem  10.26).  It  is  also  sometimes  important  to  determine  A  to  yield 
a  given  Pe.  This  is  found  as  A  =  2 Q_1(Pe),  where  Q~l  is  the  inverse  of  the  Q 
function.  It  is  defined  as  the  value  of  x  necessary  to  yield  a  given  value  of  Q(x). 
It  too  cannot  be  expressed  analytically  but  may  be  evaluated  using  the  MATLAB 
subprogram  Qinv.m,  also  listed  in  Appendix  10B. 


The  Q  function  can  also  be  approximated  for  large  values  of  x  using  [Abramowitz 
and  Stegun  1965] 


Q(x)  « 


x  >  3. 


(10.23) 


A  comparison  of  the  approximation  to  the  true  value  is  shown  in  Figure  10.19.  If 


Q> 

05 

c3 


8 

Sh 

ft 

ft 

< 


Figure  10.19:  Approximation  of  Q  function  -  true  value  is  shown  dashed. 
X  ~  then  the  right-tail  probability  becomes 


x~t*\ 

y/a2  ) 


P[X  >x]  =  Q 


(10.24) 


10.6 .  CUMULATIVE  DISTRIBUTION  FUNCTIONS 


309 


(see  Problem  10.24).  Finally,  note  that  the  area  under  the  standard  normal  Gaussian 
PDF  is  mostly  contained  in  the  interval  [—3,3].  As  seen  in  Figure  10.19  Q(3)  « 
0.001,  which  means  that  the  area  to  the  right  of  x  =  3  is  only  0.001.  Since  the  PDF 
is  symmetric,  the  total  area  to  the  right  of  x  =  3  and  to  the  left  of  x  —  —3  is  0.002  or 
the  area  in  the  [—3, 3]  interval  is  0.998.  Hence,  99.8%  of  the  probability  lies  within 
this  interval.  We  would  not  expect  to  see  a  value  greater  than  3  in  magnitude  very 
often.  This  is  borne  out  by  an  examination  of  Figure  10.8b.  How  many  realizations 
would  you  expect  to  see  in  the  interval  (1,  oo)?  Is  this  consistent  with  Figure  10.8b  ? 

As  we  have  seen,  the  CDF  for  a  continuous  random  variable  has  certain  prop¬ 
erties.  For  the  most  part  they  are  the  same  as  for  a  discrete  random  variable:  the 
CDF  is  0  at  x  =  — oo,  1  at  x  =  oo,  and  is  monotonically  increasing  (or  stays  the 
same)  between  these  limits.  However,  now  it  is  continuous,  having  no  jumps.  The 
most  important  property  for  practical  purposes  is  that  which  allows  us  to  compute 
probabilities  of  intervals.  This  follows  from  the  property 

P[a  <  X  <  b]  =  P[a  <  X  <  b]  =  Fx(b)  -  Fx{a )  (10.25) 

which  is  easily  proven  (see  Problem  10.35).  It  can  be  seen  to  be  valid  by  referring  to 
Figure  10.20.  Using  the  CDF  we  no  longer  have  to  integrate  the  PDF  to  determine 

Px{x) 


Figure  10.20:  Illustration  of  use  of  CDF  to  find  probability  of  interval. 

probabilities  of  intervals.  In  effect,  all  the  integration  has  been  done  for  us  in  finding 
the  CDF.  Some  examples  follow. 

Example  10.4  —  Probability  of  interval  for  exponential  PDF 

Since  Fx(x)  =  1  —  exp(— Ax)  for  x  >  0,  we  have  for  a  >  0  and  b  >  0 

P[a<X<b]  =  Fx(b)-Fx(a) 

=  (1  —  exp(— A6))  —  (1  —  exp(— Aa)) 

=  exp(— Aa)  —  exp(— Xb) 

which  should  be  compared  to 

b 

Aexp(— A  x)dx. 


<> 
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Since  we  obtained  the  CDF  from  the  PDF,  we  might  suppose  that  the  PDF 
could  be  recovered  from  the  CDF.  For  a  discrete  random  variable  this  was  the  case 
since  px[%i ]  =  Fx{xf)  —  Fx(x^).  For  a  continuous  random  variable  we  consider  a 
small  interval  [#o  —  A#/2,£o  +  Arr/2]  and  evaluate  its  probability  using  (10.25)  with 

/X 

Px(t)dt. 

-OO 


Then,  we  have 


Fx(%o  +  Ax/ 2)  —  Fx{x  o  —  Ax/ 2) 


so  that 


/xo+Ax/2  rxo—Ax/2 

Px(t)dt  —  /  px(t)dt 

-OO  J  — 


— OO 

'Xq+Ax/2 


— oo 


[  Px{t)dt 

J  xq—Ax/2 

pxo-\-Ax/2 

px{x o)  /  1  dt  ( px(t )  constant  as  Ax 

J  xq—Ax/2 

px(x0)Ax 


0) 


Px(x  o) 


Fx{x o  +  Ax/2)  -  Fx(x0  -  Ax/2) 


Ax 


dFx  (x) 
dx 


as  Ax  — >  0. 


X—XQ 


Hence,  we  can  obtain  the  PDF  from  the  CDF  by  differentiation  or 


Px(x)  = 


dFx  jx) 

dx 


(10.26) 


This  relationship  is  really  just  the  fundamental  theorem  of  calculus  [Widder  1989]. 
Note  the  similarity  to  the  discrete  case  in  which  px[xi\  —  Fx(xf)  —  Fx(x~).  As  an 
example,  if  X  ~  exp(A),  then 


Fx(x)  = 


1  —  exp(— Xx)  x  >  0 
0  x  <  0 . 


For  all  x  except  x  =  0  (at  which  the  CDF  does  not  have  a  derivative  due  to  the 
change  in  slope  as  seen  in  Figure  10.15)  we  have 


Px(x)  - 


dFx(x) 

dx 


=  0 

=  Aexp(— \x) 


x  <  0 
x  >  0 


and  as  remarked  earlier,  px  (0)  can  be  assigned  any  value. 
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10.7  Transformations 

In  discussing  transformations  for  discrete  random  variables  we  noted  that  a  trans¬ 
formation  can  be  either  one-to-one  or  many-to-one.  For  example,  the  function 
g(x)  =  2x  is  one-to-one  while  g(x)  =  x2  is  many-to-one  (in  this  case  two-to-one 
since  —  x  and  +x  both  map  into  x2).  The  determination  of  the  PDF  of  Y  —  g(X ) 
will  depend  upon  which  type  of  transformation  we  have.  Initially,  we  will  consider 
the  one-to-one  case,  which  is  simpler.  For  the  transformation  of  a  discrete  random 
variable  we  saw  from  (5.9)  that  the  PMF  of  Y  —  g(X)  for  any  g  could  be  found 
from  the  PMF  of  X  using 


Py[Vi\  =  Px\.xi\ ■ 

{j’9(xj)=yi} 

But  if  g  is  one-to-one  we  have  only  a  single  solution  for  g(xj)  =  so  that  Xj  = 
g~l{yi)  and  therefore 

py[yi]  =px[g~1(yi)]  (10.27) 

and  we  are  done.  For  example,  assume  X  takes  on  values  {1, 2}  with  a  PMF  px[  1] 
and  px[ 2]  and  we  wish  to  determine  the  PMF  of  Y  =  g(X)  =  2X,  which  is  shown 
in  Figure  10.21.  Then  from  (10.27) 

y  =  y{x)  =  2x 


Figure  10.21:  Transformation  of  a  discrete  random  variable. 


Py [2]  =  px\g  *(2)]  =px[l] 

Py[  4]  =  Pxlg-1  (4)]  =  px[  2]. 

Because  we  are  now  dealing  with  a  PDF,  which  is  a  density  function,  and  not  a 
PMF,  which  is  a  probability  function,  the  simple  relationship  of  (10.27)  is  no  longer 
valid.  To  see  what  happens  instead,  consider  the  problem  of  determining  the  PDF 
of  Y  =  2X,  where  X  ~  14(1.  2).  Clearly,  Sx  =  {x  :  1  <  x  <  2}  and  therefore 
Sy  —  {y  :  2  <  y  <  4}  so  that  pY(y)  must  be  zero  outside  the  interval  (2,4).  The 
results  of  a  MATLAB  computer  simulation  are  shown  in  Figure  10.22.  A  total  of 
50  realizations  were  obtained  for  X  and  Y.  The  generated  X  outcomes  are  shown 
on  the  x-axis  and  the  resultant  Y  outcomes  obtained  from  y  =  2x  are  shown  on  the 
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Figure  10.22:  Computer  generated  realizations  of  X  and  Y  =  2X  for  X  U(l,2). 
A  50%  expanded  version  of  the  realizations  is  shown  to  the  right. 

y-axis.  Also,  a  50%  expanded  version  of  the  points  is  shown  to  the  right.  It  is  seen 
that  the  density  of  points  on  the  y-axis  is  less  than  that  on  the  rc-axis.  After  some 
thought  the  reader  will  realize  that  this  is  the  result  of  the  scaling  by  a  factor  of  2 
due  to  the  transformation.  Since  the  PDF  is  probability  per  unit  length ,  we  should 
expect  py  =  Px/ 2  for  2  <  y  <  4.  To  prove  that  this  is  so,  we  note  that  a  small 
interval  on  the  z-axis,  say  [xq  —  Ax/2,xo  +  Ax/2],  will  map  into  [2xo  —  Ax,2xo  +  Ax] 
on  the  y-axis.  However,  the  intervals  are  equivalent  events  and  so  their  probabilities 
must  be  equal.  It  follows  then  that 

rxo~\~Ax/2  r2xo+Ax 

/  px{x)dx  =  /  py(y)dy 

J  xq—Ax/2  J2xq—Ax 

and  as  Ax  -»  0,  we  have  that  px{x)  px(x o)  and  py(y)  Py{ 2^o)  in  the  small 

intervals  so  that 

px(xo)Ax  =  py(2xo)2Ax 


or 


Py(2x0)  =px(x  o)-. 
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As  expected,  the  PDF  of  Y  is  scaled  by  1/2.  If  we  now  let  yo  =  2%o,  then  this 
becomes 

Py(v  o)  =Px(yo/2)- 

or  for  any  arbitrary  value  of  y 

Py(v)  =  px{y/ 2)^  2  <  y  <  4.  (10.28) 

This  results  in  the  final  PDF  using  px(%)  =  1  for  1  <  x  <  2  as 

«'(<')  =  {  0  otherwise  <la29> 

and  thus  if  X  ~  U(  1, 2),  then  y  =  2X  ~  ZY(2, 4).  The  general  result  for  the  PDF  of 
y  =  g(X)  is  given  by 

py(v)  =px{y~1{y))  —  ,  ^  ■  (10.30) 

ay 

For  our  example,  the  use  of  (10.30)  with  g(x )  =  2x  and  therefore  g^1(y)  =  y/2 
results  in  (10.29).  The  absolute  value  is  needed  to  allow  for  the  case  when  g  is 
decreasing  and  hence  g~l  is  decreasing  since  otherwise  the  scaling  term  would  be 
negative  (see  Problem  10.57).  A  formal  derivation  is  given  in  Appendix  10A.  Note 
the  similarity  of  (10.30)  to  (10.27).  The  principal  difference  is  the  presence  of  the 
derivative  or  Jacobian  factor  dg~1(y)/dy.  It  is  needed  to  account  for  the  change  in 
scaling  due  to  the  mapping  of  a  given  length  interval  into  an  interval  of  a  different 
length  as  illustrated  in  Figure  10.22.  Some  examples  of  the  use  of  (10.30)  follow. 

Example  10.5  —  PDF  for  linear  (actually  an  affine)  transformation 

To  determine  the  PDF  of  Y  =  aX  +  6,  for  a  and  b  constants  first  assume  that 
Sx  —  {x  :  — oo  <  x  <  oo}  and  hence  Sy  =  {y  :  —  oo  <  y  <  oo}.  Here  we  have 
g(x)  =  ax  +  b  so  that  the  inverse  function  g~l  is  found  by  solving  y  =  ax  +  b  for  x. 
This  yields  x  —  (y  —  b)/a  so  that 

-1,  x  y-b 

9  (y)  = - 

a 

and  from  (10.30)  the  general  result  is 

py{y)  =px  (10.31) 

As  a  further  example,  consider  X  ~  JV"(0, 1)  and  the  transformation  Y  =  Va^X  +  y. 
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Then,  letting  a  =  \To~  >  0  we  have 


Py(v)  =  Px 


y-p 


Px 


a 

y- p\  i 

a  la 


1 

a 


V^7T 

1 


exp 


1  ( y-p 

2  \  a 


1 

a 


V2tto^ 


exp 


2  (T2 


(y  -  p) 


and  therefore  Y  ~  Af(p,  a~).  A  linear  transformation  of  a  Gaussian  random  vari¬ 
able  results  in  another  Gaussian  random  variable  whose  Gaussian  PDF  has  dif¬ 
ferent  values  of  the  parameters.  Because  of  this  property  we  can  easily  gener¬ 
ate  a  realization  of  a  a2)  random  variable  using  the  MATLAB  construction 
y=sqrt(sigma2)*randn(l , l)+mu,  since  randn(l,l)  produces  a  realization  of  a 
standard  normal  random  variable  (see  Problem  10.60). 

0 


Example  10.6  -  PDF  of  Y  =  exp(X)  for  X  ~  A/"(0, 1) 

Here  we  have  that  Sy  —  {y  :  y  >  0}.  To  find  g~1(y)  we  let  y  =  exp(a;)  and  solve 
for  x ,  which  is  x  =  ln(y).  Thus,  g~l(y)  =  ln(y).  From  (10.30)  it  follows  that 


pv(y)  =px(My )) 


dln(y ) 
dy 


Px(ln(y))J  y  >  0 
0  y  <  0 


or 

Pv{y) 

This  PDF  is  called  the  log-normal  PDF.  It  is  frequently  used  as  a  model  for  a 
quantity  that  is  measured  in  decibels  (dB)  and  which  has  a  normal  PDF  in  dB 
quantities  [Members  of  Technical  Staff  1970]. 

❖ 


7^exp[— i(ln(y))2]  y>0 

0  y  <  0. 


Always  determine  the  possible  values  for  Y  before  using  (10.30). 


A  common  error  in  determining  the  PDF  of  a  transformed  random  variable  is 
to  forget  that  py{y)  may  be  zero  over  some  regions.  In  the  previous  example  of 
y  =  exp(z),  the  mapping  of  -oo  <  x  <  oo  is  into  y  >  0.  Hence,  the  PDF  of  Y  must 
be  zero  for  y  <  0  since  there  are  no  values  of  X  that  produce  a  zero  or  negative 
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value  of  Y.  Nonsensical  results  occur  if  we  attempt  to  insert  values  in  py(y)  for 
y  <  0.  To  avoid  this  potential  problem,  we  should  first  determine  Sy  and  then  use 
(10.30)  to  find  the  PDF  over  the  sample  space. 


A 

When  the  transformation  is  not  one-to-one,  we  will  have  multiple  solutions  for  x  in 
y  =  g{x).  An  example  is  for  y  =  x2  for  which  the  solutions  are 

*i  =  -Vv  =  9i1(y ) 

£2  =  +Vy  =  921(y)- 

This  is  shown  in  Figure  10.23.  In  this  case  we  use  (10.30)  but  must  add  the  PDFs 

g(x)  =  x 2 


(since  both  the  ^-intervals  map  into  the  same  y-interval  and  the  ^-intervals  are 
disjoint)  to  yield 


Py(v)  -  Px (9i  1(y)) 


*9\ 1  (y) 

dy 


+  px(g21(y )) 


d 92  1  (y) 

dy 


(10.32) 


Example  10.7  —  PDF  of  Y  =  X2  for  X  ~  V(0, 1) 

Since  — oo  <  X  <  oo,  we  must  have  Y  >  0.  Next  because  yf1(y)  =  —\fy  and 
g2  {y)  —  \fy  we  have  from  (10.32) 


Px(~y/v) 

0 


l 

2W 


1 

2vd/ 


Py(v)  = 


+  Px{^Jy) 


y  >  0 

y  <  0 
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which  reduces  to 


Py(v)  = 


[v^7r  exp(  y/2) J 

to 

+ 

[v4iexP(  v! 2)J 

2  y/y  y~° 

0 

y  <  o 

y /2) 

y  >o 

0 

y  <  o. 

This  is  shown  in  Figure  10.24  and  should  be  compared  to  Figure  2.10  in  which  this 
PDF  was  estimated  (see  also  Problem  10.59).  Note  that  the  PDF  is  undefined  at 


y 

Figure  10.24:  PDF  for  Y  =  X2  for  X  ~  V(0, 1). 


y  =  0  since  py(0)  — >  oo,  although  the  total  area  under  the  PDF  is  finite  and  of 
course  is  equal  to  1.  Also,  Y  ~  Xi  35  can  be  seen  by  referring  to  (10.12)  with 
N  =  1. 

❖ 

In  general,  if  y  =  g(x)  has  solutions  x.t  —  gt  1(y)  for  i  =  1, 2, . . . ,  M,  then 


M 

py(v)  =  ^Pxig-'iy)) 

I— 1 


d9j  1(y) 

dy 


(10.33) 


An  alternative  means  of  finding  the  PDF  of  a  transformed  random  variable  is  to  first 
find  the  CDF  and  then  differentiate  it  (see  (10.26)).  We  illustrate  this  approach  by 
redoing  the  previous  example. 
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Example  10.8  -  CDF  approach  to  determine  PDF  of  Y  =  X2  for 

X  ~  Af( 0, 1) 

First  we  determine  the  CDF  of  Y  in  terms  of  the  CDF  for  X  as 


FY(y)  =  P[Y<y } 

=  P[X2  <  y] 

=  P[-y/y  <X  <s/y\ 

—  Px(Vy)  -  Px(—y/y)- 


(from  (10.25)) 


Then,  differentiating  we  have 


pv(y) 


dFy  (y) 


d_ 

dy 


dy 

[Fx(Vy)  ~  Fx(-y/y )] 


Px(Vy)-  ^  —  Px ( — \/y)  —  (from  (10.25)  and  chain  rule  of  calculus) 


dy 

1  1 
Px{y/y)7r-p  +Px(-s/y) 


dy 


2 Vy  '  va'2^/y 

px(Vy)^g  y>  0 


o 


y  <  0 


(since  px{—x )  =  Px{%)  for  X  o,  1)) 


exp(-y/2)  ^>0 

0  y  <  0. 


0 


10.8  Mixed  Random  Variables 

We  have  so  far  described  two  types  of  random  variables,  the  discrete  random  vari¬ 
able  and  the  continuous  random  variable.  The  sample  space  for  a  discrete  random 
variable  consists  of  a  countable  (either  finite  or  infinite)  set  of  points  while  that  for  a 
continuous  random  variable  has  an  infinite  and  uncountable  set  of  points.  The  points 
in  Sx  for  a  discrete  random  variable  have  a  nonzero  probability  while  those  for  a 
continuous  random  variable  have  a  zero  probability.  In  some  physical  situations, 
however,  we  wish  to  assign  a  nonzero  probability  to  some  points  but  not  others.  As 
an  example,  consider  an  experiment  in  which  a  fair  coin  is  tossed.  If  it  comes  up 
heads,  we  generate  the  outcome  of  a  continuous  random  variable  X  ~  M{ 0, 1)  and 
if  it  comes  up  tails  we  set  X  =  0.  Then,  the  possible  outcomes  are  —oo<x<oo 
and  the  probability  of  any  point  except  x  =  0  has  a  zero  probability  of  occurring. 
However,  the  point  x  =  0  occurs  with  a  probability  of  1  /2  since  the  probability  of 
a  tail  is  1/2.  A  typical  sequence  of  outcomes  is  shown  in  Figure  10.25.  One  could 
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Figure  10.25:  Sequence  of  outcomes  for  mixed  random  variable  -  X  =  0  with  nonzero 
probability. 


define  a  random  variable  as 

X 

~  X"(0, 1) 

if  heads 

X 

=  0 

if  tails 

which  is  neither  a  discrete  nor  a  continuous  random  variable.  To  find  its  CDF  we 
use  the  law  of  total  probability  to  yield 

Fx(x)  =  P[X<x\ 

—  P[X  <  x | heads] P [heads]  +  P[X  <  x | tails] P [tails] 

f  $0*05  +  0(s)  x  <  0 

1  $(®)5  +  !(s)  *  ^  0 

which  can  be  written  more  succinctly  using  the  unit  step  function.  The  unit  step 
function  is  defined  as  u(x)  =  1  for  x  >  0  and  u(x)  =  0  for  x  <  0.  With  this  definition 
the  CDF  becomes 


Fx{x)  =  -$(£)  +  -u(a;)  —  oo  <  x  <  oo. 

The  CDF  is  shown  in  Figure  10.26.  Note  the  jump  at  x  =  0,  indicative  of  the 
contribution  of  the  discrete  part  of  the  random  variable.  The  CDF  is  continuous  for 
all  x  7^  0  but  has  a  jump  at  x  —  0  of  1/2.  It  corresponds  to  neither  a  discrete  random 
variable,  whose  CDF  consists  only  of  jumps,  nor  a  continuous  random  variable, 
whose  CDF  is  continuous  everywhere.  Hence,  it  is  called  a  mixed  random  variable. 
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Figure  10.26:  CDF  for  mixed  random  variable. 


Its  CDF  is  in  general  continuous  except  for  a  countable  number  of  jumps  (either 
finite  or  infinite).  As  usual  it  is  right-continuous  at  the  jump. 

Strictly  speaking,  a  mixed  random  variable  does  not  have  a  PMF  or  a  PDF. 
However,  by  the  use  of  the  Dirac  delta  function  (also  called  an  impulse),  we  can 
define  a  PDF  which  may  then  be  used  to  find  the  probability  of  an  interval  via 
integration  by  using  (10.4).  To  first  find  the  PDF  we  attempt  to  differentiate  the 
CDF 

Px(x)  =  f^(«)  +  lu(x)  ■ 


The  difficulty  encountered  is  that  u(x )  is  discontinuous  at  x  —  0  and  thus  formally  its 
derivative  does  not  exist  there.  We  can,  however,  define  a  derivative  for  the  purposes 
of  probability  calculations  as  well  as  for  conceptualization.  To  do  so  requires  the 
introduction  of  the  Dirac  delta  function  S(x)  which  is  defined  as  (see  also  Appendix 

D) 

=  *p.. 

(JbJb 

The  function  5(:r)  is  usually  thought  of  as  a  very  narrow  pulse  with  a  very  large 
amplitude  which  is  centered  at  x  =  0.  It  has  the  property  that  S(t )  =0  for  alH  ^  0 
but  e 

J  S(t)dt  =  1 

for  e  a  small  positive  number.  Hence,  the  area  under  the  narrow  pulse  is  one.  Using 
this  definition  we  can  now  differentiate  the  CDF  to  find  that 


(10.34) 
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which  is  shown  in  Figure  10.27.  This  may  be  thought  of  as  a  generalized  PDF.  Note 


x 


Figure  10.27:  PDF  for  mixed  random  variable. 

that  it  is  the  strength ,  which  is  defined  as  the  area  under  the  approximating  narrow 
pulse,  that  is  equal  to  1/2.  The  amplitude  is  theoretically  infinite.  The  CDF  can 
be  recovered  using  (10.16)  and  the  result  that 

/x+ 

S(t)dt 

-oo 

where  x+  means  that  the  integration  interval  is  (—00,  x  +  e]  for  e  a  small  positive 
number.  Thus,  the  impulse  should  be  included  in  the  integration  interval  if  x  =  0 
so  that  u(0)  —  1  according  to  the  definition  of  the  unit  step  function. 


When  do  we  include  the  impulse  in  the  integration  interval? 


For  a  mixed  random  variable  the  presence  of  impulses  in  the  PDF  requires  a  mod¬ 
ification  to  (10.4).  This  is  because  an  endpoint  of  the  interval  can  have  a  nonzero 
probability.  As  a  result,  the  probabilities  P[0  <  X  <  1]  and  P[0  <  X  <  1]  will  be 
different  if  there  is  an  impulse  at  x  =  0.  Specifically,  consider  the  computation  of 
P[0  <  X  <  1]  and  note  that  the  probability  of  X  =  0  should  be  included.  Therefore, 
if  there  is  an  impulse  at  x  =  0,  the  area  under  the  PDF  should  include  the  contri¬ 
bution  of  the  impulse.  Thus,  the  integration  interval  should  be  chosen  as  [0~,  1]  so 
that 

P[0  <  X  <  1]  =  f  px(x)dx. 

Jo - 
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The  more  general  modifications  to  (10.4)  are 


P[a<X<  b ] 
P[a  <  X  <  b] 
P[a<X  <  b] 
P[a<X  <  b] 


where  x~  is  a  number  slightly  less  than  x  and  xr  is  a  number  slightly  greater  than 
x.  Of  course,  if  the  PDF  does  not  have  any  impulses  at  x  =  a  or  x  =  b.  then  all  the 
integrals  above  will  be  the  same  and,  therefore  there  is  no  need  to  choose  between 
them.  See  also  Problem  10.51. 


A 

Continuing  with  our  example,  let’s  say  we  wish  to  determine  P[— 2  <  X  <  2].  Then, 
using  (10.4)  since  the  impulse  does  not  occur  at  one  of  the  interval  endpoints,  and 
our  generalized  PDF  of  (10.34)  yields 


P[— 2  <  X  <  2] 


/ 

/ 


px(x)dx 

2 

2  r 


1  1 


2  L2  \/27r 
2 


exp 


r2) + \6{x) 


\LwM~\x2)dx+\ 

\  [Q(- 2)  -  Q(  2)]  +  \ 

I  [1  -  2Q(2)]  +  1  =  1-  Q(2). 


dx 


S(x)dx 


Alternatively,  we  could  have  obtained  this  result  using  P[— 2  <  X  <  2]  =  Fx( 2)  - 
Fx(~ 2)  with  Fx(x)  =  (1/2)(1  -  Q{x))  +  (l/2)u{x). 

Mixed  random  variables  often  arise  as  a  result  of  a  transformation  of  a  continuous 
random  variable.  A  final  example  follows. 

Example  10.9  —  PDF  for  amplitude-limited  Rayleigh  random  variable 

Consider  a  Rayleigh  random  variable  whose  PDF  is  given  by  (10.14)  that  is  input 
to  a  device  that  limits  its  output.  One  might  envision  a  physical  quantity  such  as 
temperature  and  the  device  being  a  thermometer  which  can  only  read  temperatures 
up  to  a  maximum  value.  All  temperatures  above  this  maximum  value  are  read  as  the 
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maximum.  Then  the  effect  of  the  device  can  be  represented  by  the  transformation 


y  =  g(x)  = 


x  o  <  x  <  zmax 

^max  X  >  3?max 


which  is  shown  in  Figure  10.28.  The  PDF  of  Y  is  zero  for  y  <  0  since  X  can  only 


V  =  g{x) 


take  on  nonnegative  values.  For  0  <  y  <  xmax  it  is  seen  from  Figure  10.28  that 
9~1{y)  —  V-  Finally,  for  y  >  a;max  we  have  from  Figure  10.28  the  infinite  number  of 
solutions  x  £  [^max?oo).  Thus,  we  have  for  region  1  or  for  y  <  0  that  py(y)  —  0. 
For  region  2  or  for  0  <  y  <  xmQiX  where  g~1(y )  =  y,  we  have  from  (10.30) 

py(v)  =  px(g~l(y )) 

=  Px(y)- 

For  region  3  which  is  y  >  a;max,  we  note  that  Y  cannot  exceed  :rmax  and  so  y  =  ,'i;max 
is  the  only  possible  value  for  y  in  region  3.  The  probability  of  Y  —  xmax  is  equal  to 
the  probability  that  X  >  xmax.  In  particular,  it  is 


dg  1(y) 

dy 


roc 

P[Y  =  Zmax]  .  I  pX(x)dx 

**  ^max 


(10.35) 


since  from  Figure  10.28  the  x-interval  [:rmax,  oo)  is  mapped  into  the  y-point  given  by 
y  =  imax’  Since  the  probability  of  Y  at  the  point  y  =  .x'inax  is  nonzero,  we  represent 
its  contribution  to  the  PDF  by  using  an  impulse  as 


Pv{y)  = 


r  oo 

/  px{x)dx 

**  ^max 


fi(y  ^max)  y  —  X 


max 
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In  summary,  the  PDF  of  the  transformed  random  variable  is 


py(v)  =  < 


0 

px{y) 

IZ^Px(x)dx  S(y 


—  X 


max 


0 


y  <  o 

0  ^  y  ^  ^max 
y  —  ^max 
y  ^  ^max  • 


It  is  seen  to  be  the  PDF  of  a  mixed  random  variable  in  that  it  contains  an  impulse. 
Finally,  for  x  >  0  the  Rayleigh  PDF  is  for  a2  =  1 


Px(x)  —  xexp  (  —  -x‘ 


so  that  the  PDF  of  Y  becomes 


Pr(y)  =  < 


=  < 


0 

y  <  o 

y exp  (-\y2) 

0  <  y  <  x 

f°°  xexp(—ix2)dx 

J  %max  \  Z  / 

^max)  y  —  ^max 

0 

y  ^  ^max  • 

0 

y  <  o 

y exp  i-W) 

0  ^  y  ^  ^max 

exp  (  2^hnax)  ^{y  ^max)  y  —  ^max 

0 

y  ^  ^max- 

max 


This  is  plotted  in  Figure  10.29b. 


Figure  10.29:  PDFs  before  and  after  transformation  of  Figure  10.28. 
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In  general,  if  a  random  variable  X  can  take  on  a  continuum  of  values  as  well 
as  discrete  values  {aq,  X2  . . .}  with  corresponding  nonzero  probabilities  {pi,P25  •  •  •}, 
then  the  PDF  of  the  mixed  random  variable  X  can  be  written  in  the  succinct  form 

oo 

Px{x)  =  pc(x)  +  5(x  -  Xi )  (10.36) 

i= 1 


where  pc{x)  represents  the  contribution  to  the  PDF  of  the  continuous  part  (its 
integral  must  be  <  1)  and  must  satisfy  pc(x )  >  0.  To  be  a  valid  PDF  we  require 
that 

/oo  00 

pc(x)dx  +  ^Tpi  =  1- 

■°°  i=i 

For  solely  discrete  random  variables  we  can  use  the  generalized  PDF 


OO 

Px{x)  =  ^TpiS(x  -  x^ 
i- 1 


or  equivalently  the  PMF 

Px[xi)=Pi  i  =  1,2,... 
to  perform  probability  calculations. 


10.9  Computer  Simulation 


In  simulating  the  outcome  of  a  discrete  random  variable  X  we  saw  in  Figure  5.14  that 
first  an  outcome  of  a  U  ~  U{ 0,1)  random  variable  is  generated  and  then  mapped 
into  a  value  of  X.  The  mapping  needed  was  the  inverse  of  the  CDF.  This  result 
is  also  valid  for  a  continuous  random  variable  so  that  X  —  F^l(U)  is  a  random 
variable  with  CDF  Fx{x).  Stated  another  way,  we  have  that  U  —  Fx(X)  or  if 
a  random  variable  is  transformed  according  to  its  CDF,  the  transformed  random 
variable  U  ~  U( 0,1).  This  latter  transformation  is  termed  the  probability  integral 
transformation.  The  transformation  X  =  F^l(U)  is  called  the  inverse  probability 
integral  transformation.  Before  proving  these  results  we  give  an  example. 

Example  10.10  —  Probability  integral  transformation  of  exponential  ran¬ 
dom  variable 

Since  the  exponential  PDF  is  given  for  A  =  1  by 


Px{x)  = 


exp(— x)  x  >  0 
0  x  <  0 


the  CDF  is  from  (10.16) 


f  0  x  <  0 

\  1  —  exp(— x)  x  >  0 . 


Fx{x)  = 
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The  probability  integral  transformation  asserts  that  Y  =  g(X)  =  Fx{X)  has  a 
U( 0, 1)  PDF.  Considering  the  transformation  g(x)  =  1  —  exp(— x)  for  x  >  0  and  zero 
otherwise,  we  have  that  y  —  1  —  exp(— x)  and,  therefore  the  unique  solution  for  x  is 
x  =  —  ln(l  —  y)  for  0  <  y  <  1  and  zero  otherwise.  Hence, 


(  -  ln(l  -y)  0  <  y  <  1 

\  0  otherwise 


and  using  (10.30),  we  have  for  o  <y  <  1 


py  ( y )  = 


Px(g  1{y)) 


dg  1(y) 

dy 


exp  [-  (-  ln(l  -  y))] 

1. 


1 

i  -y 


Finally,  then 


0  <  y  <  1 
otherwise 


which  is  the  PDF  of  a  W(0, 1)  random  variable. 


To  summarize  our  results  we  have  the  following  theorem. 


Theorem  10.9.1  (Inverse  Probability  Integral  Transformation)  If  a  contin¬ 
uous  random  variable  X  is  given  as  X  =  F^1(U)J  where  U  ~  U{ 0, 1),  then  X  has 
the  PDF px(x)  =  dFx{x)/dx. 


Proof: 

Let  V  —  F^l(U)  and  consider  the  CDF  of  V. 


Fy(v) 


P[V  <v]  =  P[F~l(U )  <  v] 

P[U  <  Fjv(^)]  (Fx  is  monotonically  increasing  -  see  Problem  10.58) 

rFxiy) 

/  pu(u)du 

Jo 

rFx(v) 

/  1  du 

Jo 

Fx(v). 


Hence,  the  CDFs  of  V  and  X  are  equal  and  therefore  the  PDF  of  V  =  Fxl{U)  is 
Px(x). 

A 

Another  example  follows. 
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Example  10.11  —  Computer  generation  of  outcome  of  Laplacian  random 
variable 

The  Laplacian  random  variable  has  a  PDF 


Px(x) 


1 


■\/2<72 


exp 


CTi 


X 


OO  <  X  <  oo 


and  therefore  its  CDF  is  found  as 


F*w  =  /lv5fexp 


For  x  <  0  we  have 


Fx(x) 


and  for  x  >  0  we  have 


g4 


dt. 


/X 

-oo 


V2g2 


exp 


t 


G4 


dt 


=  2  eXP 


a4 


X 


— oo 


1 


exp 


x 


G4 


Fx(x) 


f°  1 

7-00  y/2 G2 


exp 


1  _  1 
2  _  2 


exp 


t 


G4 


dt  T 


G4 


x 


0 


1  -  2  eXp 


X 


G4 


rx  i 
Jo  V2g2 


exp 


G4 


dt 


(first  integral  is  1/2  since  px(~ %)  =  Px(%)) 


By  letting  y  —  Fx(x ),  we  have 


x  <  0 
x  >  0 . 


We  note  that  if  x  <  0,  then  0  <  y  <  1/2,  and  if  x  >  0,  then  1/2  <  y  <  1.  Thus, 
solving  for  x  to  produce  F^1(y)  yields 


x  = 


y/cr2j2hi(2y)  0  <  y  <  1/2 

v^V2ln l/2<y<l. 
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Finally  to  generate  the  outcome  of  a  Laplacian  random  variable  we  can  use 


x  — 


yjcr2  / 2  ln(2u)  0  <  u  <  1/2 

V^72ln(2(T^))  1/2<U<1 


(10.37) 


where  u  is  a  realization  of  a  W(0, 1)  random  variable.  An  example  of  the  outcomes 
of  a  Laplacian  random  variable  with  a2  =  1  is  shown  in  Figure  10.30a.  In  Figure 
10.30b  the  true  PDF  (the  solid  curve)  along  with  the  estimated  PDF  (the  bar  plot)  is 
shown  based  on  M  —  1000  outcomes.  The  estimate  of  the  PDF  was  accomplished  by 


(a)  First  50  outcomes 


(b)  True  PDF  and  estimated  PDF  based 
on  1000  outcomes 


Figure  10.30:  Computer  generation  of  Laplacian  random  variable  outcomes  using 
inverse  probability  integral  transformation. 


the  procedure  described  in  Example  2.1  (see  Figure  2.7  for  the  code  for  a  Gaussian 
PDF).  We  can  now  justify  that  procedure.  Since  from  Section  10.3  we  have 

,  N  P[x o  —  /S.x/2  <  X  <  x$  +  Ax/2\ 

px{x o)  «  — - -  ~  - - — 


and 


P[x o  —  Ax/2  <  X  <  xq  +  Ax/2] 


we  use  as  our  PDF  estimator 


Number  of  outcomes  in  [xq  —  Ax/ 2,  xq  +  Ax/2] 

M 


px(x  o) 


Number  of  outcomes  in  [xo  —  Ax/2,  Xq  +  Ax/2 ] 

MAx 


(10.38) 


In  Figure  10.30b  we  have  chosen  the  bins  ox  intervals  to  be  [—4.25,  —3.75],  [-3.75,  —3.25] 
. . . ,  [3.75, 4.25]  so  that  Ax  =  0.5.  We  have  therefore  estimated px(—^ 4),Px{— 3.5), . . . , 


•  •  • 
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px( 4).  To  estimate  the  PDF  at  more  points  we  would  have  to  decrease  the  binwidth 
or  Ax.  However,  in  doing  so  we  cannot  make  it  too  small.  This  is  because  as  the 
binwidth  decreases,  the  probability  of  an  outcome  falling  within  the  bin  also  de¬ 
creases.  As  a  result,  fewer  of  the  outcomes  will  occur  within  each  bin,  resulting  in 
a  poor  estimate.  The  only  way  to  remedy  this  situation  is  to  increase  the  number 
of  trials  M.  What  do  you  suppose  would  happen  if  we  wanted  to  estimate  px{ 5)? 
The  MATLAB  code  for  producing  the  PDF  estimate  is  given  below. 

#/«  Assume  outcomes  are  in  x,  which  is  M  x  1  vector 
M=1000; 

bincenters=[-4:0.5:4]  * ;  °/0  set  binwidth  =0.5 
bins=length(bincenters) ; 
h=zeros (bins , 1) ; 

for  i=l :length(x)  %  count  outcomes  in  each  bin 
for  k=l:bins 

if  x(i)>bincenters(k) -0.5/2.  ... 

&  x(i)<bincenters(k)+0.5/2 
h(k, l)=h(k, 1)+1 ; 
end 
end 
end 

pxest=h/(M*0.5) ;  #/0  see  (10.38) 

The  CDF  can  be  estimated  by  using 

Number  of  outcomes  <  x 
M 

and  is  the  same  for  either  a  discrete  or  a  continuous  random  variable.  See  also 
Problems  10.60-62. 

❖ 

10.10  Real-World  Example  -  Setting  Clipping  Levels  for 
Speech  Signals 

In  order  to  communicate  speech  over  a  transmission  channel  it  is  important  to  make 
sure  that  the  equipment  does  not  “clip”  the  speech  signal.  Commercial  broadcast 
stations  commonly  use  VU  meters  to  monitor  the  power  of  the  speech.  If  the  power 
becomes  too  large,  then  the  amplifier  gains  are  manually  decreased.  Clipped  speech 
sounds  distorted  and  is  objectionable.  In  other  situations,  the  amplifier  gains  must 
be  set  automatically,  as  for  example,  in  telephone  speech  transmission.  This  is 
necessary  so  that  the  speech,  if  transmitted  in  an  analog  form,  is  not  distorted  at 
the  receiver,  and  if  transmitted  in  a  digital  form  is  not  clipped  by  an  analog-to- 
digital  convertor.  To  determine  the  highest  amplitude  of  the  speech  signal  that  can 


(10.39) 


Fx(x) 
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be  expected  to  occur  a  common  model  is  to  use  a  Laplacian  PDF  for  the  amplitudes 
[Rabiner  and  Schafer  1978].  Hence,  most  of  the  amplitudes  are  near  zero  but  larger 
level  ones  are  possible  according  to 


As  seen  in  Figure  10.10,  the  width  of  the  PDF  increases  as  a2  increases.  In  effect, 
a2  measures  the  width  of  the  PDF  and  is  actually  its  variance  (to  be  shown  in 
Problem  11.34).  The  parameter  a2  is  also  a  measure  of  the  speech  power.  In  order 
to  avoid  excessive  clipping  we  must  be  sure  that  an  amplifier  can  accommodate  a 
high  level,  even  if  it  occurs  rather  infrequently.  A  design  requirement  might  then  be 
to  transmit  a  speech  signal  without  clipping  99%  of  the  time.  A  model  for  a  clipper 
is  shown  in  Figure  10.31.  As  long  as  the  input  signal,  i.e.,  x,  remains  in  the  interval 


Figure  10.31:  Clipper  input-output  characteristics. 


—A  <  x  <  A,  the  output  will  be  the  same  as  the  input  and  no  clipping  takes  place. 
However,  if  x  >  A,  the  output  will  be  limited  to  A  and  similarly  if  x  <  —A.  Clipping 
will  then  occur  whenever  |rr|  >  A.  To  satisfy  the  design  requirement  that  clipping 
should  not  occur  for  99%  of  the  time,  we  should  choose  A  (which  is  a  characteristic 
of  the  amplifier  or  analog-to-digital  convertor)  so  that  Pcj-  <  0.01.  But 

Pclip  =  P[X  >  A  or  X  <  —A] 

and  since  the  Laplacian  PDF  is  symmetric  about  x  —  0  this  is  just 
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Hence,  if  this  probability  is  to  be  no  more  than  0.01,  we  must  have 


<0.01 


or  solving  for  A  produces  the  requirement  that 


A> 


(10.41) 


It  is  seen  that  as  the  speech  power  a2  increases,  so  must  the  clipping  level  A.  If  the 
clipping  level  is  fixed,  then  speech  with  higher  powers  will  be  clipped  more  often.  As 
an  example,  consider  a  speech  signal  with  a2  =  1.  The  Laplacian  model  outcomes 
are  shown  in  Figure  10.32  along  with  a  clipping  level  of  A  =  1.  According  to  (10.40) 


Figure  10.32:  Outcomes  of  Laplacian  random  variable  with  a2  =  1  -  model  for 
speech  amplitudes. 

the  probability  of  clipping  is  exp(-\/2)  =  0.2431.  Since  there  are  50  outcomes  in 
Figure  10.32  we  would  expect  about  50*0.2431  «  12  instances  of  clipping.  From  the 
figure  we  see  that  there  are  exactly  12.  To  meet  the  specification  we  should  have 
that 

'4a^ta(ak)=3-25' 

As  seen  from  Figure  10.32  there  are  no  instances  of  clipping  for  A  =  3.25.  In  order 
to  set  the  appropriate  clipping  level  A,  we  need  to  know  a2.  In  practice,  this  too 
must  be  estimated  since  different  speakers  have  different  volumes  and  even  the  same 
speaker  will  exhibit  a  different  volume  over  time! 


REFERENCES 


331 


References 

Abramowitz,  M.,  I. A.  Stegun,  Handbook  of  Mathematical  Functions ,  Dover,  New 
York,  1965. 

Capinski,  M.,  P.E.  Kopp,  Measure,  Integral,  and  Probability ,  Springer- Verlag,  New 
York,  2004. 

Johnson,  N.L.,  S.  Kotz,  N.  Balakrishnan,  Continuous  Univariate  Distributions, 
Vols.  1,2 ,  John  Wiley  &  Sons,  New  York,  1994. 

Members  of  Technical  Staff,  Transmission  Systems  for  Communications ,  Western 
Electric  Co.,  Inc.,  Winston-Salem,  NC,  1970. 

Rabiner,  L.R.,  R.W.  Schafer,  Digital  Processing  of  Speech  Signals ,  Prentice-Hall, 
Englewood  Cliffs,  NJ,  1978. 

Widder,  D.A.,  Advanced  Calculus ,  Dover,  New  York,  1989. 


Problems 


10.1  (  w)  Are  the  following  random  variables  continuous  or  discrete? 

a.  Temperature  in  degrees  Fahrenheit 

b.  Temperature  rounded  off  to  nearest  1° 

c.  Temperature  rounded  off  to  nearest  1/2° 

d.  Temperature  rounded  off  to  nearest  1/4° 

10.2  (0)  (  w)  The  temperature  in  degrees  Fahrenheit  is  modeled  as  a  uniform  ran¬ 
dom  variable  with  T  ~  U( 20,60).  If  T  is  rounded  off  to  the  nearest  1/2°  to 

A  A 

form  T,  what  is  P[T  —  30°]?  What  can  you  say  about  the  use  of  a  PDF  versus 
a  PMF  to  describe  the  probabilistic  outcome  of  a  physical  experiment? 

10.3  (w)  A  wedge  of  cheese  as  shown  in  Figure  10.5  is  sliced  from  x  —  a  to  x  —  b  . 
If  a  —  0  and  b  —  0.2,  what  is  the  mass  of  cheese  in  the  wedge?  How  about  if 
a  —  1.8  and  b  —  2? 

10.4  (o)  (w)  Which  of  the  functions  shown  in  Figure  10.33  are  valid  PDFs?  If  a 
function  is  not  a  PDF,  why  not? 

10.5  (f)  Determine  the  value  of  c  to  make  the  following  function  a  valid  PDF 


c(l  —  |x/5|) 

0 


x 


<  5 


otherwise. 


g{x) 
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(a)  (b)  (c) 

Figure  10.33:  Possible  PDFs  for  Problem  10.4. 


10.6  (o)  (w)  A  Gaussian  mixture  PDF  is  defined  as 


Px(x)  -=  ax  —J—f  exp  (  — — -oX2  )  +a2-  -^—=exp  ( --+X2) 

v™2  V  2af  )  V  2of  ) 

for  a\  ±  o\.  What  are  the  possible  values  for  a\  and  a2  so  that  this  is  a  valid 
PDF? 


10.7  (w)  Find  the  area  under  the  curves  given  by  the  following  functions: 


9i(x) 


92(x)  = 


x  0  <  x  <  1 
1  +  x  l<x<2 
0  otherwise 

x  0<x<l 
1+x  l<x<2 

0  otherwise 


and  explain  your  results. 


10.8  (w)  A  memory  chip  has  a  projected  lifetime  X  in  days  that  is  modeled  as 
X  ~  exp(O.OOl).  What  is  the  probability  that  it  will  fail  within  one  year? 


10.9  (t)  In  this  problem  we  prove  that  the  Gaussian  PDF  integrates  to  one.  First 
we  let 


and  write  1 2  as  the  iterated  integral 


dydx. 


Next,  convert  (x,  y)  into  polar  coordinates  and  evaluate  the  expression  to  prove 
that  I2  =  1.  Finally,  you  can  conclude  that  1=1  (why?). 
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10.10  (f,c)  If  X  ~  N{n,  cj2),  find  P[X  >  n  +  aa\  for  a  =  1,2,  3,  where  a  =  \[o2 . 


10.11  (t) 

med] 


The  median  of  a  PDF  is  defined  as  the  point  x  =  med  for  which  P[X  < 
—  1/2.  Prove  that  if  X  ~  M(/a,  a2),  then  med  =  ja. 


10.12  (o)  (w)  A  constant  or  DC  current  source  that  outputs  1  amp  is  connected 
to  a  resistor  of  nominal  resistance  of  1  ohm.  If  the  resistance  value  can  vary 
according  to  R  ~  AC(1,  0.1),  what  is  the  probability  that  the  voltage  across 
the  resistor  will  be  between  0.99  and  1.01  volts? 


10.13  (w)  An  analog-to-digital  convertor  can  convert  voltages  in  the  range  [—3,  3] 
volts  to  a  digital  number.  Outside  this  range,  it  will  “clip”  a  positive  voltage 
at  the  highest  positive  level,  i.e.,  +3,  or  a  negative  voltage  at  the  most  negative 
level,  i.e.,  —3.  If  the  input  to  the  convertor  is  modeled  as  X  ^  A?(/i,  1),  how 
should  (i  be  chosen  to  minimize  the  probability  of  clipping? 

10.14  (o)  (f)  Find  P[X  >  3]  for  the  two  PDFs  given  by  the  Gaussian  PDF  with 
/i  =  0,  a2  —  1  and  the  Laplacian  PDF  with  a2  =  1.  Which  probability  is  larger 
and  why?  Plot  both  PDFs. 

10.15  (f)  Verify  that  the  Cauchy  PDF  given  in  (10.9)  integrates  to  one. 

10.16  (t)  Prove  that  T(z  +  1)  —  zT(z)  by  using  integration  by  parts  (see  Appendix 
B  and  Problem  11.7). 

10.17  (0)(f)  The  arrival  time  in  minutes  of  the  Nth  person  at  a  ticket  counter 
has  a  PDF  that  is  Erlang  with  A  =  0.1.  What  is  the  probability  that  the 
first  person  will  arrive  within  the  first  5  minutes  of  the  opening  of  the  ticket 
counter?  What  is  the  probability  that  the  first  two  persons  will  arrive  within 
the  first  5  minutes  of  opening? 

10.18(f)  A  person  cuts  off  a  wedge  of  cheese  as  shown  in  Figure  10.5  starting  at 
x  =  0  and  ending  at  some  value  x  —  xq.  Determine  the  mass  of  the  wedge  as 
a  function  of  the  value  xq.  Can  you  relate  this  to  the  CDF? 


10.19  (o)  (f)  Determine  the  CDF  for  the  Cauchy  PDF. 


10.20  (f)  If  X  ^  A?(0, 1)  find  the  probability  that  \X\  <  a,  where  a  =  1,2,  3.  Also, 
plot  the  PDF  and  shade  in  the  corresponding  areas  under  the  PDF. 


10.21  (f,c)  If  A  A?(0, 1),  determine  the  number  of  outcomes  out  of  1000  that 
you  would  expect  to  occur  within  the  interval  [1,  2].  Next  conduct  a  computer 
simulation  to  carry  out  this  experiment.  How  many  outcomes  actually  occur 
within  this  interval? 


10.22  (o)  (w)  If  X  ~  A 7(/i,  cr2),  find  the  CDF  of  X  in  terms  of  4>(x). 
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10.23  (t)  If  a  PDF  is  symmetric  about  x  =  0  (also  called  an  even  function ),  prove 
that  Fx(—x)  —  1  —  Fx(x).  Does  this  property  hold  for  a  Gaussian  PDF  with 
fj,  =  0?  Hint:  See  Figure  10.16. 

10.24  (t)  Prove  that  if  X  ~  JV(fj,,a2),  then 

P[X  >a]  =  Q  (^) 

where  a  =  \fa*. 


10.25  (t)  Prove  the  properties  of  the  Q  function  given  by  (10.19)-(10.22). 

10.26  (f)  Plot  the  function  Q(A/ 2)  versus  A  for  0  <  A  <  5  to  verify  the  true 
probability  of  error  as  shown  in  Figure  2.15. 

10.27(c)  IfX  Af( 0,1),  evaluate  P[X  >  4]  and  then  verify  your  results  using 
a  computer  simulation.  How  easy  do  you  think  it  would  be  to  determine 
P[X  >  7]  using  a  computer  simulation?  (See  Section  11.10  for  an  alternative 
approach.) 

10.28  (^)  (w)  A  survey  is  taken  of  the  incomes  of  a  large  number  of  people  in 
a  city.  It  is  determined  that  the  income  in  dollars  is  distributed  as  X  ~ 
Af( 50000, 108).  What  percentage  of  the  people  have  incomes  above  $70,000? 

10.29  (w)  In  Chapter  1  an  example  was  given  of  the  length  of  time  in  minutes 
an  office  worker  spends  on  the  telephone  in  a  given  10-minute  period.  The 
length  of  time  T  was  given  as  Af( 7, 1)  as  shown  in  Figure  1.5.  Determine  the 
probability  that  a  caller  is  on  the  telephone  more  than  8  minutes  by  finding 
P[T  >  8]. 

10.30  (^)  (w)  A  population  of  high  school  students  in  the  eastern  United  States 
score  X  points  on  their  SATs,  where  X  ~  Af( 500, 4900).  A  similar  population 
in  the  western  United  States  score  X  points,  where  X  ~  jV(525, 3600).  Which 
group  is  more  likely  to  have  scores  above  700? 

10.31  (f)  Verify  the  numerical  results  given  in  (1.3). 

10.32  (f)  In  Example  2.2  we  asserted  that  P\X  >  2]  for  a  standard  normal  random 
variable  is  0.0228.  Verify  this  result. 

10.33  (^)  (w)  Is  the  following  function  a  valid  CDF? 


Fx(x)  = 


1 

1  +  exp(— x) 


—  OO  <  X  <  oo. 


10.34  (f)  If  Fx{x )  —  (2/ 7r)  arctan(x)  for  0  <  x  <  oo,  determine  P[ 0  <  X  <  1]. 
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10.35  (t)  Prove  that  (10.25)  is  true. 

10.36  (o)  (w)  Professor  Staff  always  scales  his  test  scores.  He  adds  a  number  of 
points  c  to  each  score  so  that  50%  of  the  class  get  a  grade  of  C.  A  C  is  given  if 
the  score  is  between  70  and  80.  If  the  scores  have  the  distribution  Af( 65, 38), 
what  should  c  be?  Hint:  There  are  two  possible  solutions  to  this  problem  but 
the  students  will  prefer  only  one  of  them. 

10.37  (w)  A  Rhode  Island  weatherman  says  that  he  can  accurately  predict  the 
temperature  for  the  following  day  95%  of  the  time.  He  makes  his  prediction 
by  saying  that  the  temperature  will  be  between  Tx°  Fahrenheit  and  Fahren¬ 
heit.  If  he  knows  that  the  actual  temperature  is  a  random  variable  with  PDF 
^(50, 10),  what  should  his  prediction  be  for  the  next  day? 

10.38  (f)  For  the  CDF  given  in  Figure  10.14  find  the  PDF  by  differentiating.  What 
happens  at  x  =  1  and  x  =  2? 

10.39  (f,c)  If  Y  —  exp(A),  where  A  ~  U( 0, 1),  find  the  PDF  of  Y.  Next  generate 
realizations  of  A  on  a  computer  and  transform  them  according  to  exp  (A)  to 
yield  the  realizations  of  Y.  Plot  the  re’s  and  y’s  in  a  similar  manner  to  that 
shown  in  Figure  10.22  and  discuss  your  results. 

10.40  (0)  (f)  Find  the  PDF  of  Y  =  A4  +  1  if  A  -  exp(A). 

10.41  (w)  Find  the  constants  a  and  b  so  that  Y  =  aX  +  6,  where  A  W(  0,1), 
yields  Y  ~U{2, 6). 

10.42  (f)  If  Y  =  aX,  find  the  PDF  of  Y  if  the  PDF  of  X  is  px{%)-  Next,  assume 
that  X  ~  exp(l)  and  find  the  PDFs  of  Y  for  a  >  1  and  0  <  a  <  1.  Plot  these 
PDFs  and  explain  your  results. 

10.43  (^)  (f)  Find  a  general  formula  for  the  PDF  of  Y  =  |X|.  Next,  evaluate  your 
formula  if  X  is  a  standard  normal  random  variable. 

10.44(f)  IfX  rsj  0, 1)  is  transformed  according  to  Y  =  exp(A),  determine py(y) 
by  using  the  CDF  approach.  Compare  your  results  to  those  given  in  Example 
10.6.  Hint:  You  will  need  Leibnitz’s  rule 

d  fob)  dg(y) 

Ty J. . 

10.45  (w)  A  random  voltage  A  is  input  to  a  full  wave  rectifier  that  produces  at  its 
output  the  absolute  value  of  the  voltage.  If  A  is  a  standard  normal  random 
variable,  what  is  the  probability  that  the  output  of  the  rectifier  will  exceed  2? 
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10.46  (^)  (f,c)  If  Y  =  X 2,  where  X  ~  W(0, 1),  determine  the  PDF  of  Y.  Next 
perform  a  computer  simulation  using  the  realizations  of  Y  (obtained  as  ym  = 
x where  xm  is  the  mth  realization  of  X)  to  estimate  the  PDF  py{y)-  Do 
your  theoretical  results  match  the  simulated  results? 


10.47  (w)  If  a  discrete  random  variable  X  has  a  Ber(p)  PMF,  find  the  PDF  of  X 
using  impulses.  Next  find  the  CDF  of  X  by  integrating  the  PDF. 


10.48  (t)  In  this  problem  we  point  out  that  the  use  of  impulses  or  Dirac  delta 
functions  serves  mainly  as  a  tool  to  allow  sums  to  be  written  as  integrals.  For 
example,  the  sum 

N 

2—1 

can  be  written  as  the  integral 


S 


g(x)dx 


if  we  define  g(x)  as 


N 

g{% )  =  Yai$(x  ~  *)• 

i= 1 


Verify  that  this  is  true  and  show  how  it  applies  to  computing  probabilities  of 
events  of  discrete  random  variables  by  using  integration. 


10.49  (f)  Evaluate  the  expression 


Could  the  integrand  represent  a  PDF?  If  it  does,  what  does  this  integral  rep¬ 
resent? 


10.50  (w)  Plot  the  PDF  and  CDF  if 

1  11 

Px{x)  =  -  exp(-a:)u(a:)  +  -6(x  +  1)  +  -S(x  -  1). 

10.51  (^)  (w)  For  the  PDF  given  in  Problem  10.50  determine  the  following: 

P[- 2  <  X  <  2],  P[- 1  <  X  <  1],  P[-l  <  X  <  1],  P[- 1  <  X  <  1], 
P[-l  <  X  <  1], 

10.52  (f)  Find  and  plot  the  PDF  of  the  transformed  random  variable 


r  2X  0  <  X  <  1 
l  2  X  >  1 


where  X  ~  exp(l). 
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10.53  (f )  Find  the  PDF  representation  of  the  PMF  of  a  bin(3, 1/2)  random  variable. 
Plot  the  PMF  and  the  PDF. 

10.54  (^)  (f)  Determine  the  function  g  so  that  X  =  g(U ),  where  U  ~U( 0,1),  has 
a  Rayleigh  PDF  with  a2  =  1. 

10.55  (f)  Find  ci  transformation  so  that  X  —  (j  ( IJ ) .  where  II  ~  IA  ( 0 . 1 ) ,  has  the 
PDF  shown  in  Figure  10.34. 


Figure  10.34:  PDF  for  Problem  10.55 


10.56  (c)  Verify  your  results  in  Problem  10.55  by  generating  realizations  of  the 
random  variable  whose  PDF  is  shown  in  Figure  10.34.  Next  estimate  the 
PDF  and  compare  it  to  the  true  PDF. 

10.57  (t)  A  monotonically  increasing  function  g(x)  is  defined  as  one  for  which  if 
X2  >  xi,  then  g(x 2)  >  g(x  1).  A  monotonically  decreasing  function  is  one 
for  which  if  X2  >  aq,  then  g(x 2)  <  g(x  1).  It  can  be  shown  that  if  g(x) 
is  differentiable,  then  a  function  is  monotonically  increasing  (decreasing)  if 
dg(x)/dx  >  0  (dg(x)/dx  <  0)  for  all  x.  Which  of  the  following  functions  are 
monotonically  increasing  or  decreasing:  exp  (a;),  ln(:r),  and  1/x? 

10.58  (t)  Explain  why  the  values  of  x  for  which  the  inequality  x  >  xq  is  true  do  not 
change  if  we  take  the  logarithm  of  both  sides  to  yield  ln(x)  >  ln(xo).  Would 
the  inequality  still  hold  if  we  inverted  both  sides  or  equivalently  applied  the 
function  g(x)  =  1/a:  to  both  sides?  Hint:  See  Problem  10.57. 

10.59  (w)  Compare  the  true  PDF  given  in  Figure  10.24  with  the  estimated  PDF 
shown  in  Figure  2.10.  Are  they  the  same  and  if  not,  why  not? 

10.60  (c)  Generate  on  the  computer  realizations  of  the  random  variable  X  ~ 
A7(l,4).  Estimate  the  PDF  and  compare  it  to  the  true  one. 
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10.61  (c)  Determine  the  PDF  of  Y  —  X3  if  X  ~  U{ 0, 1).  Next  generate  realizations 
of  X  on  the  computer,  apply  the  transformation  g(x)  =  x 3  to  each  realiza¬ 
tion  to  yield  realizations  of  Y,  and  finally  estimate  the  PDF  of  Y  from  these 
realizations.  Does  it  agree  with  the  true  PDF? 

10.62  (c)  For  the  random  variable  Y  described  in  Problem  10.61  determine  the 
CDF.  Then,  generate  realizations  of  Y,  estimate  the  CDF,  and  compare  it  to 
the  true  one. 


Appendix  10A 


Derivation  of  PDF  of  a 
Transformed  Continuous 
Random  Variable 


The  proof  uses  the  CDF  approach  as  described  in  Section  10.7.  It  assumes  that  g 
is  a  one-to-one  function.  If  Y  =  g(X ),  where  g  is  a  one-to-one  and  monotonically 
increasing  function,  then  there  is  a  single  solution  for  x  in  y  =  g(x).  Thus, 

Fy(y)  =  P[g(X)  <  y] 

=  P[X<g-\y)] 

=  Pxig^iv)]- 


But  py(y)  =  dFy(y)/dy  so  that 


Py(v)  = 


Ty™1  <»» 

d  Fx  {%) 


dx 


dg  1(y) 


=  Px(g  1(y)) 


x=g~l{y)  dy 
dg~l{y ) 


(chain  rule  of  calculus) 


dy 


If  g(x)  is  one-to-one  and  monotonically  decreasing ,  then 


Fy(y)  =  P[g(X)<y] 

=  P[X  >  3_1(y)] 

=  1  -  P[X  <  ff_1(y)]  (since  P[X  =  g~l(y)\  =  0) 
=  1  -Fx{g-l{y)) 
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and 


Py(v) 


Note  that  if  5  is  montonically  decreasing,  then  g~l  is  also  montonically  decreasing. 
Hence,  dg~1(y)/dy  will  be  negative.  Thus,  both  cases  can  be  subsumed  by  the 
formula 

py(v)  =px(g~1(y )) 


dg  Hy) 

dy 


Appendix  10B 


MATLAB  Subprograms  to 
Compute  Q  and  Inverse  Q 
Functions 

7.  Q.m 

7. 

7,  This  program  computes  the  right-tail  probability 
7,  (complementary  cumulative  distribution  function)  for 
7.  a  N(0,1)  random  variable. 

7. 

7.  Input  Parameters : 

7. 

7.  x  -  Real  column  vector  of  x  values 

7. 

7.  Output  Parameters : 

7. 

7.  y  -  Real  column  vector  of  right-tail  probabilities 
7. 

7.  Verification  Test  Case: 

7. 

7.  The  input  x=  [0  1  2]  ’ ;  should  produce  y=  [0 . 5  0 . 1587  0 . 0228] 

7. 

function  y=Q(x) 

y=0.5*erfc(x/sqrt(2)) ;  7.  complementary  error  function 


7.  Qinv.m 
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1 

%  This  program  computes  the  inverse  Q  function  or  the  value 
#/c  which  is  exceeded  by  a  N(0,1)  random  variable  with  a 
7,  probability  of  x. 

7. 

7.  Input  Parameters : 

7. 

7#  x  -  Real  column  vector  of  right-tail  probabilities 
7o  (in  interval  [0,1]) 

7. 

7c  Output  Parameters : 

7c 

7c  y  -  Real  column  vector  of  values  of  random  variable 
7c 

7c  Verification  Test  Case: 

7. 

7c  The  input  x=[0.5  0.1587  0.0228]*;  should  produce 
7.  y=  [0  0.9998  1.9991]  *  . 

7. 

function  y=Qinv(x) 

y=sqrt (2)*erf inv(l-2*x)  ;  7c  inverse  error  function 


Chapter  11 


Expected  Values  for  Continuous 
Random  Variables 

11.1  Introduction 

We  now  define  the  expectation  of  a  continuous  random  variable.  In  doing  so  we 
parallel  the  discussion  of  expected  values  for  discrete  random  variables  given  in 
Chapter  6.  Based  on  the  probability  density  function  (PDF)  description  of  a  con¬ 
tinuous  random  variable,  the  expected  value  is  defined  and  its  properties  explored. 
The  discussion  is  conceptually  much  the  same  as  before,  only  the  particular  method 
of  evaluating  the  expected  value  is  different.  Hence,  we  will  concentrate  on  the 
manipulations  required  to  obtain  the  expected  value. 


11.2  Summary 

The  expected  value  E[X\  for  a  continuous  random  variable  is  motivated  from  the 
analogous  definition  for  a  discrete  random  variable  in  Section  11.3.  Its  definition  is 
given  by  (11.3).  An  analogy  with  the  center  of  mass  of  a  wedge  is  also  described. 
For  the  expected  value  to  exist  we  must  have  i£[|X|]  <  oo  or  the  expected  value  of 
the  absolute  value  of  the  random  variable  must  be  finite.  The  expected  values  for 
the  common  continuous  random  variables  are  given  in  Section  11.4  with  a  summary 
given  in  Table  11.1.  The  expected  value  of  a  function  of  a  continuous  random 
variable  can  be  easily  found  using  (11.10),  eliminating  the  need  to  find  the  PDF  of 
the  transformed  random  variable.  The  expectation  is  shown  to  be  linear  in  Example 
11.2.  For  a  mixed  random  variable  the  expectation  is  computed  using  (11.11).  The 
variance  is  defined  by  (11.12)  with  some  examples  given  in  Section  11.6.  It  has 
the  same  properties  as  for  a  discrete  random  variable,  some  of  which  are  given  in 
(11.13),  and  is  a  nonlinear  operation.  The  moments  of  a  continuous  random  variable 
are  defined  as  E[Xn]  and  can  be  found  either  by  using  a  direct  integral  evaluation  as 
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in  Example  11.6  or  by  using  the  characteristic  function  (11.18).  The  characteristic 
function  is  the  Fourier  transform  of  the  PDF  as  given  by  (11.17).  Central  moments, 
which  are  the  moments  about  the  mean,  are  related  to  the  moments  by  (11.15). 
The  second  central  moment  is  just  the  variance.  Although  the  probability  of  an 
event  cannot  in  general  be  determined  from  the  mean  and  variance,  the  Chebyshev 
inequality  of  (11.21)  provides  a  formula  for  bounding  the  probability.  The  mean  and 
variance  can  be  estimated  using  (11.22)  and  (11.23).  Finally,  an  application  of  mean 
estimation  to  test  highly  reliable  software  is  described  in  Section  11.10.  It  is  based 
on  importance  sampling,  which  provides  a  means  of  estimating  small  probabilities 
with  a  reasonable  number  of  Monte  Carlo  trials. 


11.3  Determining  the  Expected  Value 


The  expected  value  for  a  discrete  random  variable  X  was  defined  in  Chapter  6  to 
be 


E[X]  =  ^2  XiPx  N] 

i 


(11.1) 


where  px[%i]  is  the  probability  mass  function  (PMF)  of  X  and  the  sum  is  over  all  i 
for  which  the  PMF  px[xi]  is  nonzero.  In  the  case  of  a  continuous  random  variable, 
the  sample  space  Sx  is  not  countable  and  hence  (11.1)  can  no  longer  be  used.  For 
example,  if  X  ~U( 0,1),  then  X  can  take  on  any  value  in  the  interval  (0, 1),  which 
consists  of  an  uncountable  number  of  values.  We  might  expect  that  the  average 
value  is  E[X]  —  1/2  since  the  probability  of  X  being  in  any  equal  length  interval 
in  (0, 1)  is  the  same.  To  verify  this  conjecture  we  employ  the  same  strategy  used 
previously,  that  of  approximating  a  uniform  PDF  by  a  uniform  PMF ,  using  a  fine 
partitioning  of  the  interval  (0, 1).  Letting 


Px[xi]  = 


M 


Xi  =  iAx 


for  i  —  1,2,...,  M.  and  with  Ax  =  1/M ,  we  have  from  (11.1) 


M  M  ,  1 

B[X\  =  Ex*MW  =  E<iAj;)(l7 

1=1  1=1  ' 

M  M 

'Ew^w'E'1' 

i=l  i= 1 


But  Z£i  i  =  (Af/2)  (AT  +  1)  so  that 


E[X\  =  ™  {M^t l)  =1+  1 


M2 


2  2  M 


(11.2) 
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and  as  M  — >  oc  or  the  partition  of  (0, 1)  becomes  infinitely  fine,  we  have  E[X]  — >  1/2, 
as  expected.  To  extend  these  results  to  more  general  PDFs  we  first  note  from  (11.2) 
that 


M 


E[X) 


YxiP[xi  —  Ax/2  <  X  <  X{  +  Ax/2] 


i=  1 
M 

E 

i=  1 


P[xi  —  Ax  2  <  X  <  Xi  +  Ax  2  A 

X{ - - - -z-Ax. 

Ax 


But 


P[xi  —  Ax/2  <  X  <  Xi  +  Ax/2]  1/M 

Ax  Ax 

and  as  Ax  0,  this  is  the  probability  per  unit  length  for  all  small  intervals  centered 
about  x*,  which  is  the  PDF  evaluated  at  x  =  x*.  In  this  example,  px{%i)  does  not 
depend  on  the  interval  center,  which  is  x$,  so  that  the  PDF  is  uniform  or  px(x)  =  1 
for  0  <  x  <  1.  Thus,  as  Ax  —>  0 


M 

E[X]  ->  E  XiPx(xi)Ax 

i= 1 


and  this  becomes  the  integral 

E[X\  =  f  xpx(x)dx 

Jo 

where  px{x)  —  1  for  0  <  x  <  1  and  is  zero  otherwise.  To  confirm  that  this  integral 
produces  a  result  consistent  with  our  earlier  value  of  E[X]  =  1/2,  we  have 

E[X]  —  f  xpx{x)dx 

Jo 

f1  1  ,  1  2  1  1 

Jo  2  0  2 

In  general,  the  expected  value  for  a  continuous  random  variable  X  is  defined  as 

/oo 

xpx(x)dx  (11.3) 

-OO 

where  px{%)  is  the  PDF  of  X.  Another  example  follows. 

Example  11.1  —  Expected  value  for  random  variable  with  a  nonuniform 
PDF 

Consider  the  computation  of  the  expected  value  for  the  PDF  shown  in  Figure  11.1a. 
From  the  PDF  and  some  typical  outcomes  shown  in  Figure  11.1b  the  expected  value 
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px{x)  =  x/2 


(a)  PDF 


Trial  number 


(b)  Typical  outcomes  and  expected  value  of 
1.33 


Figure  11.1:  Example  of  nonuniform  PDF  and  its  mean. 


should  be  between  1  and  2.  Using  (11.3)  we  have 


E[X\ 


which  appears  to  be  reasonable. 


o 

As  an  analogy  to  the  expected  value  we  can  revisit  our  Jarlsberg  cheese  first  de¬ 
scribed  in  Section  10.3,  and  which  is  shown  in  Figure  11.2.  The  integral 


CM  =  f  xm(x)dx  (11.4) 

Jo 


9 

is  the  center  of  mass,  assuming  that  the  total  mass  or  J0  m(x)dx,  is  one.  Here, 
m(x)  is  the  linear  mass  density  or  mass  per  unit  length.  The  center  of  mass  is  the 
point  at  which  one  could  balance  the  cheese  on  the  point  of  a  pencil.  Recall  that 
the  linear  mass  density  is  m(x)  =  x/2  for  which  CM  =  4/3  from  Example  11.1.  To 

C\ 

show  that  CM  is  the  balance  point  we  first  note  that  fQ  m(x)dx  =  1  so  that  we  can 


11.3.  DETERMINING  THE  EXPECTED  VALUE 


347 


center  of  mass 
at  x  =4/3 

2  - - 


Figure  11.2:  Center  of  mass  (CM)  analogy  to  average  value. 


write  (11.4)  as 


sum 


moment  arm 


mass 


0 

0 

0. 


Since  the  “sum”  of  the  mass  times  moment  arms  is  zero,  the  cheese  is  balanced  at 
x  =  CM  =  4/3. 

By  the  same  argument  the  expected  value  can  also  be  found  by  solving 

/OO 

(x-E[X])px(x)dx  =  0  (11.5) 

-OO 

for  E[X ].  If,  however,  the  PDF  is  symmetric  about  some  point  x  =  o,  which  is  to 
say  that  px(a  +  u)  =  px(«  —  u)  for  — oo  <  u  <  oo,  then  (see  Problem  11.2) 

/OO 

(x  —  a)px(x)dx  —  0  (11-6) 

-OO 


and  therefore  E[X ]  =  a.  Such  was  the  case  for  X  ~  ZY( 0,1),  whose  PDF  is  symmetric 
about  a  =  1/2.  Another  example  is  the  Gaussian  PDF  which  is  symmetric  about 
a  —  p  as  seen  in  Figures  10.8a  and  10.8c.  Hence,  E[X]  =  p  for  a  Gaussian  random 
variable  (see  also  the  next  section  for  a  direct  derivation).  In  summary,  if  the  PDF 
is  symmetric  about  a  point,  then  that  point  is  E[X],  However,  the  PDF  need  not 
be  symmetric  about  any  point  as  in  Example  11.1. 
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Not  all  PDFs  have  expected  values. 


Before  computing  the  expected  value  of  a  random  variable  using  (11.3)  we  must 
make  sure  that  it  exists  (see  similar  discussion  in  Section  6.4  for  discrete  random 
variables).  Not  all  integrals  of  the  form  J ^  xpx{x)dx  exist,  even  if  f^X)px(x)dx  = 
1.  For  example,  if 

x  >  1 


Px{x)  =  \  2x3/2 

0  x  <  1 


then 


roo  i 

/  Ix-^dtx 

J  i  2 


1 

\fx 


OO 


=  1 


but 


roc  i 

/  x-x~3/2dx  =  V^|^° 

J  l  ^ 


OO. 


A  more  subtle  and  somewhat  surprising  example  is  the  Cauchy  PDF.  Recall  that  it 
is  given  by 

Px{x)  =  JZ  - - o7  -  OO  <  X  <  OO. 

7r(l  +  X2) 

Since  the  PDF  is  symmetric  about  x  =  0,  we  would  expect  that  E[X ]  =  0.  However, 
if  we  are  careful  about  our  definition  of  expected  value  by  correctly  interpreting  the 
region  of  integration  in  a  limiting  sense,  we  would  have 

r0  nU 

E[X]  =  lim  /  xpx(x)dx  +  lim  /  xpx(x)dx. 

L-^-oc  JL  U-+oc  Jq 


But  for  a  Cauchy  PDF 


E[X]  =  lim 


/: 


l 


X 


L^-oo  JL  W7T(1+X2) 

l 

lim  —  ln(l  +  x2) 
L — y — oo  Z7T 


dx  +  lim 


im  f 

_>°°  Jo 


u 


X 


0 


lPZoJ0  7r(l  +  X2) 

u 


dx 


+  lim  —  ln(l  +  a;2) 


U-+oo  2n 


0 


=  lim  -4-ln(l  +  L2)+  lim  —  '  tt2 

L— >— oo  2n  U-* oo  2n 

—  — oo  +  oo  =? 


ln(l  +  U2) 


Hence,  if  the  limits  are  taken  independently,  then  the  result  is  indeterminate.  To 
make  the  expected  value  useful  in  practice  the  independent  choice  of  limits  (and  not 
L  =  U)  is  necessary.  The  indeterminancy  can  be  avoided,  however,  if  we  require 
“absolute  convergence”  or 

/oo 

\x\px(x)dx  <  oo. 

-oo 
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Hence,  E[X ]  is  defined  to  exist  if  £?[|X|]  <  oo.  This  surprising  result  can  be  “ver¬ 
ified”  by  a  computer  simulation,  the  results  of  which  are  shown  in  Figure  11.3.  In 


(a)  First  50  outcomes  (b)  Sample  mean 

Figure  11.3:  Illustration  of  nonexistence  of  Cauchy  PDF  mean. 

Figure  11.3a  the  first  50  outcomes  of  a  total  of  10,000  are  shown.  Because  of  the 
slow  decay  of  the  “tails”  of  the  PDF  or  since  the  PDF  decays  only  as  1/x2,  very 
large  outcomes  are  possible.  As  seen  in  Figure  11.3b  the  sample  mean  does  not 
converge  to  zero  as  might  be  expected  because  of  these  infrequent  but  very  large 
outcomes  (see  also  Problem  12.41).  See  also  Problem  11.3  on  the  simulation  used  in 
this  example  and  Problems  11.4  and  11.9  on  how  to  make  the  sample  mean  converge 
by  truncating  the  PDF. 

A 

Finally,  as  for  discrete  random  variables  the  expected  value  is  the  best  guess  of  the 
outcome  of  the  random  variable.  By  “best”  we  mean  that  the  use  of  b  —  E[X]  as 
our  estimator.  This  estimator  minimizes  the  mean  square  error,  which  is  defined  as 
mse  =  E[(X  —  b )2]  (see  Problem  11.5). 

11.4  Expected  Values  for  Important  PDFs 

We  now  determine  the  expected  values  for  the  important  PDFs  described  in  Chapter 
10.  Of  course,  the  Cauchy  PDF  is  omitted. 

11.4.1  Uniform 

If  X  ~  U(a,  6),  then  it  is  easy  to  prove  that  E[X]  =  (a  +  b)/ 2  or  the  mean  lies  at 
the  midpoint  of  the  interval  (see  Problem  11.8). 
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11.4.2  Exponential 

If  X  ~  exp(A),  then 

roo 

E[X ]  =  /  rrAexp(— A  x)dx 

Jo 

(11.7) 


1 

xexp(— Xx)  —  —  exp(— Xx) 

A 


oo 


0 


1 

A 


Recall  that  the  exponential  PDF  spreads  out  as  A  decreases  (see  Figure  10.6)  and 
hence  so  does  the  mean. 


11.4.3  Gaussian  or  Normal 

If  X  ~  a2),  then  since  the  PDF  is  symmetric  about  the  point  x  =  //,  we  know 

that  E[X]  =  fi.  A  direct  appeal  to  the  definition  of  the  expected  value  yields 


E[X]  = 


/ 


OO 


X 


— oo 
oo 


v/2tt(t2 


/  (z- ft) 

J  — OO 

/oo 

-oo 


exp 

1 


dx 


V2tw^ 


exp 

r 


i 


2a2 


(x  -  n)' 


dx 


\/2na2 


exp 


2<j2 


(x  -  jS) 


dx. 


Letting  u  =  x  —  /i  in  the  first  integral  we  have 


E[X]  =  J_ 


OO 


1 


U 


oo 


\ph TO2 


exp 


1 


2a2 


u4 


roc 


du+fJL 


J-oo  V2na2 


exp 


dx  =  /i. 


■v^ 

0 


The  first  integral  is  zero  since  the  integrand  is  an  odd  function  (g(—u)  =  — g(u ),  see 
also  Problem  11.6)  and  the  second  integral  is  one  since  it  is  the  total  area  under  the 
Gaussian  PDF. 


11.4.4  Laplacian 

The  Laplacian  PDF  is  given  by 

px(x) = ^exp 

and  since  it  is  symmetric  about  x  =  0  (and  the  expected  value  exists  -  needed  to 
avoid  the  situation  of  the  Cauchy  PDF),  we  must  have  E[X]  =  0. 


OO  <  X  <  oo 


(11.8) 
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11.4.5  Gamma 

If  X  ~  r(a,  A),  then  from  (10.10) 

r  oo  \q 

E[X ]  =  /  £— — —  exp(— \x)dx. 

Jo  F(a) 

To  evaluate  this  integral  we  attempt  to  modify  the  integrand  so  that  it  becomes  the 
PDF  of  a  r(a',  A')  random  variable.  Then,  we  can  immediately  equate  the  integral 
to  one.  Using  this  strategy 

prvi  A“  f°°  A“+1  a  ,  ,  w  r(a  +  l) 

E[X]  =  — —  /  — 7 - —x  exp {-\x)dx—--, 

1  J  r(a)  Jo  r(a  +  1)  '  AQ+1 

(integrand  is  r(a  +  1,  A)  PDF) 

(using  Property  10.3) 

~~  A- 

11.4.6  Rayleigh 

It  can  be  shown  that  E[X]  =  \/ (7tct2)/2  (see  Problem  11.16). 

The  reader  should  indicate  on  Figures  10.6-10.10,  10.12,  and  10.13  where  the 
mean  occurs. 


r(o:  +  1) 
Ar(a) 

aT(a) 

Ar(a) 


11.5  Expected  Value  for  a  Function  of  a  Random  Vari¬ 
able 


If  Y  =  g(X),  where  X  is  a  continuous  random  variable,  then  assuming  that  Y  is 
also  a  continuous  random  variable  with  PDF  Py(v),  we  have  by  the  definition  of 
expected  value  of  a  continuous  random  variable 


/OO 

ypY(y)dy. 

-OO 


(11.9) 


Even  if  Y  is  a  mixed  random  variable,  its  expected  value  is  still  given  by  (11.9), 
although  in  this  case  py(y)  will  contain  impulses.  Such  would  be  the  case  if  for 
example,  Y  =  max(0,  X)  for  X  taking  on  values  — oo  <  x  <  oo  (see  Section  10.8). 
As  in  the  case  of  a  discrete  random  variable,  it  is  not  necessary  to  use  (11.9)  directly, 
which  requires  us  to  first  determine  py(y)  from  px(x).  Instead,  we  can  use  for 
Y  —  g(X)  the  formula 


/OO 

-OO 


g(x)px(x)dx. 


(11.10) 
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A  partial  proof  of  this  formula  is  given  in  Appendix  11  A.  Some  examples  of  its  use 
follows. 

Example  11.2  -  Expectation  of  linear  (affine)  function 

If  Y  =  aX  +  b,  then  since  g(x)  =  ax  +  b,  we  have  from  (11.10)  that 

/OO 

(ax  +  b)px(x)dx 

-OO 

/oo  roc 

xpx(x)dx  +  6  /  px{x)dx 
-oo  J — oo 

=  aE[X]+b 


or  equivalently 


E[aX  +  6]  =  aE[X]  +  b. 


It  indicates  how  to  easily  change  the  expectation  or  mean  of  a  random  variable.  For 
example,  to  increase  the  mean  value  by  b  just  replace  X  by  X  +  b.  More  generally, 
it  is  easily  shown  that 


E(al9l(X)  +  a2g2(X )]  =  aiE[9l(X)\  +  a2E\g2(X)]. 


This  says  that  the  expectation  operator  is  linear. 


Example  11.3  -  Power  of  Af(0 , 1)  random  variable 

If  X  ~  Af( 0, 1)  and  Y  =  X2,  consider  E[Y]  =  E[X2].  The  quantity  E[X2]  is  the 
average  squared  value  of  X  and  can  be  interpreted  physically  as  a  power.  If  X  is 
a  voltage  across  a  1  ohm  resistor,  then  X2  is  the  power  and  therefore  E[X2]  is  the 
average  power.  Now  according  to  (11.10) 

x2  L_  exp  (— 

l  2  ; 

0  r  2  i  M2 

Jo  \/2tt  V  2 


dx 


dx  (integrand  is  symmetric  about  x  =  0). 


To  evaluate  this  integral  we  use  integration  by  parts  (f  UdV  =  UV  —  f  VdU,  see 
also  Problem  11.7)  with  U  =  x,  dU  =  dx,  dV  =  (l/\/27r)a;exp[—  (l/2)x2]dx  and 
therefore  V  —  —  (1/\/27t)  exp[—  (l/2)r2]  to  yield 

=  0  +  1  =  1. 


The  first  term  is  zero  since 


lim  2;  exp 

X-A-OC 


lim 

x— >00 


X 

exp  ( \x 2) 


lim  - 7T~5 

x-aoo  xexp  {^xz 
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using  L’Hospital’s  rule  and  the  second  term  is  evaluated  using 


(Why?). 


Example  11.4  —  Expected  value  of  indicator  random  variable 

An  indicator  function  indicates  whether  a  point  is  in  a  given  set.  For  example,  if 
the  set  is  A  =  [3,4],  then  the  indicator  function  is  defined  as 


Ia(x) 


1  3  <  x  <  4 
0  otherwise 


and  is  shown  in  Figure  11.4.  The  subscript  on  I  refers  to  the  set  of  interest.  The 


1.5  - 


^  1 


0.5  - 


0 


n - r~ 


1  1 


0  1 


n - r~ 


3 

X 


Figure  11.4:  Example  of  indicator  function  for  set  A  =  [3,4]. 

indicator  function  may  be  thought  of  as  a  generalization  of  the  unit  step  function 
since  if  u[x)  —  1  for  x  >  0  and  zero  otherwise,  we  have  that 

-^[0,oo)  (t)  u(x) . 

Now  if  A  is  a  random  variable,  then  I  a  (A)  is  a  transformed  random  variable  that 
takes  on  values  1  and  0,  depending  upon  whether  the  outcome  of  the  experiment  lies 
within  the  set  A  or  not,  respectively.  (It  is  actually  a  Bernoulli  random  variable.) 


On  the  average,  however,  it  has 

a  value  between  0  and  1,  which  from  (11.10)  is 

E[Ia(X)}  = 

roo 

/  lA(x)px(%)dx 

>  —  oo 

/  1  -  px(x)dx  (definition) 

* , 

f  {x\x(LA} 

/  px(%)dx 

* , 

f  {x\x(LA} 

=  P[A]. 
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Therefore,  the  expected  value  of  the  indicator  random  variable  is  the  probability  of  the 
set  or  event  As  an  example  of  its  utility,  consider  the  estimation  of  P[ 3  <  X  <  4]. 
But  this  is  just  E[Ia(X)\  when  Ia(x)  is  given  in  Figure  11.4.  To  estimate  the 
expected  value  of  a  transformed  random  variable  we  first  generate  the  outcomes  of  X, 
say  #i,  #2,  •  •  • ,  xm,  then  transform  each  one  to  the  new  random  variable  producing 


for  i  = 


1,2, ...  ,M 


Ia{xi)  - 


1  3  <  xi  <  4 
0  otherwise 


and  finally  compute  the  sample  mean  for  our  estimate  using 


E[Ia(X)}  = 


1 

M 


M 


X,lA(Xi)- 
2  —  1 


However,  since  P[A\  =  E[Ia{X)\,  we  have  as  our  estimate  of  the  probability 


1 

M 


M 

2—1 


But  this  is  just  what  we  have  been  using  all  along,  since  counts  all 

the  outcomes  for  which  3  <  x  <  4.  Thus,  the  indicator  function  provides  a  means 
to  connect  the  expected  value  with  the  probability.  This  is  a  very  useful  for  later 
theoretical  work  in  probability. 


Lastly,  if  the  random  variable  is  a  mixed  one  with  PDF 


oo 

px(x)  =  pc(x)  +  ~  Xi) 

i= 1 


where  pc(x)  is  the  continuous  part  of  the  PDF,  then  the  expected  value  becomes 


/CO  /  °°  \ 

X  (  Pc(x)  +  5(x  -  Xi)  \  dx 

/OO  pOO  °° 

xpc(x)dx  +  /  x  ^^piS(x  —  Xi)dx 

-oo  J  — oo 


/OO  °°  poo 

xpc(x)dx  +  ^2/Pi  /  x 6(x  —  Xi)dx 

-oo  J —oo 

/oo  °° 

xpc(x)dx  +  E  XiPi 
•°°  2—1 


(11.11) 


since  f^°00g{x)6(x  —  xi)dx  —  g(xi )  for  g(x)  a  function  continuous  at  x  =  X{.  This 
is  known  as  the  sifting  property  of  a  Dirac  delta  function  (see  Appendix  D).  A 
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Values 

PDF 

E[X } 

var(X) 

a<x<b 

l 

|( a+b ) 

(b—a)2 

exp(jiob)— exp(joua) 

b—a 

12 

ju(b~a) 

X  >  0 

A  exp  (—Ax) 

1 

A 

i 

A 

X-ju 

—  00<X<00 

exp[-(l/(2cr2))(x-/i)2l 

/  / 

cr2 

exp[juj/j,—a2uj2/2] 

y/2i rcr2 

r 

—oo<x<oo 

^=exp(  -v/2/°r2M) 

0 

a2 

2/a2 

u2+2/a2 

x  >  0 

r(a)X  exp(-Ax) 

a 

a 

1 

A 

A* 

(l-ju/X)a 

X  >  0 

^exp[-x2/(2<r2)] 

/  7 r<j2 

V  2 

(2— 7t/2  )a2 

[Johnson 
et  al  1994] 

Uniform 

Exponential 

Gaussian 

Laplacian 

Gamma 

Rayleigh 


Table  11.1:  Properties  of  continuous  random  variables. 

summary  of  the  means  for  the  important  PDFs  is  given  in  Table  11.1.  Lastly,  note 
that  the  expected  value  of  a  random  variable  can  also  be  determined  from  the  CDF 
as  shown  in  Problem  11.28. 


11.6  Variance  and  Moments  of  a  Continuous 
Random  Variable 

The  variance  of  a  continuous  random  variable,  as  for  a  discrete  random  variable, 
measures  the  average  squared  deviation  from  the  mean.  It  is  defined  as  var(X)  = 
E[(X  —  E[X])2]  (exactly  the  same  as  for  a  discrete  random  variable).  To  evaluate 
the  variance  we  use  (11.10)  to  yield 


/OO 

(x  —  E[X])2px{x)dx. 

-oo 


(11.12) 


As  an  example,  consider  a  A/*(/i,(J2)  random  variable.  In  Figure  10.9  we  saw  that 
the  width  of  the  PDF  increases  as  cr2  increases.  This  is  because  the  parameter  cr2 
is  actually  the  variance,  as  we  now  show.  Using  (11.12)  and  the  definition  of  a 
Gaussian  PDF 


var(X)  = 


0 x-E[X}) 


exp 


;7RT 


2cr2 


(x  —  n)2  dx 


/°°  i  r  i  I 

(x  —  n)2  —==  exp  —  tt-o  (#  —  m)2  dx  (recall  that  E[X]  —  jj) 
-00  v27T(T2  l  2(7  J 


356 


CHAPTER  11.  EXPECTED  VALUES 


Letting  u  =  (x  —  \T)jo  produces  (recall  that  a  =  wo1  >0) 


var(X)  = 


‘OO  1 

2  2  1 
G  U 


— OO 


V2ttg2 


exp 


2a2 * 


adu 


=  a 


OO  -l 

2  /  „  .2  l 


U 


■oo 


V27T 


exp 


"2* 


du 


(see  Example  11.3) 


a2. 


Hence,  we  now  know  that  a  A a2)  random  variable  has  a  mean  of  p  and  a  variance 
of  a2. 

It  is  common  to  refer  to  the  square-root  of  the  variance  as  the  standard  deviation. 
For  a  A/*(^,a2)  random  variable  it  is  given  by  a.  The  standard  deviation  indicates 
how  closely  outcomes  tend  to  cluster  about  the  mean.  (See  Problem  11.29  for  an 
alternative  interpretation.)  Again  if  the  random  variable  is  A/*(/i,(J2),  then  68.2% 
of  the  outcomes  will  be  within  the  interval  [/i  —  a,  /i  +  <j],  95.5%  will  be  within 
\/a  —  2g,ii  +  2cr],  and  99.8%  will  be  within  [p,  —  3a,  fi  +  3a].  This  is  illustrated  in 
Figure  11.5.  Of  course,  other  PDFs  will  have  concentrations  that  are  different  for 
E[X\  d=  fcyv ar(X).  Another  example  follows. 


(a)  68.2%  for  1  standard 
deviation 


(b)  95.5%  for  2  standard 
deviations 


(c)  99.8%  for  3  standard 
deviations 


Figure  11.5:  Percentage  of  outcomes  of  Af(  1,1)  random  variable  that  are  within 
k  —  1,2,  and  3  standard  deviations  from  the  mean.  Shaded  regions  denote  area 
within  interval  p  —  ka  <  x  <  fj,  +  ka. 

Example  11.5  —  Variance  of  a  uniform  random  variable 

If  X  ~  U(a,  6),  then 


var(X)  = 


roc 

'  ( x  —  E[X])2px(x)dx 

— oo 


a 


2^  +  &) 


b  —  a 
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and  letting  u  =  x  —  (a +  b)/ 2,  we  have 

a)/2 

u  du 

-a)/2 
(b—a)/2 

—  (b—a)/ 2 


0 

A  summary  of  the  variances  for  the  important  PDFs  is  given  in  Table  11.1.  The 
variance  of  a  continuous  random  variable  enjoys  the  same  properties  as  for  a  discrete 
random  variable.  Recall  that  an  alternate  form  for  variance  computation  is 

var(X)  =  E[X 2]  -  E2[X] 

and  if  c  is  a  constant  then 

var(c)  =  0 
var(X  +  c)  =  var(X) 

var(cX)  =  c2var(X).  (11.13) 

Also,  the  variance  is  a  nonlinear  type  of  operation  in  that 

var(£/i(X)  +  g2(X))  ^  vax(g1(X))  +  vax(g2(X)) 

(see  Problem  11.32).  Recall  from  the  discussions  for  a  discrete  random  variable  that 
E[X]  and  E\X2]  are  termed  the  first  and  second  moments,  respectively.  In  general, 
E[Xn]  is  termed  the  nth  moment  and  it  is  defined  to  exist  if  U[|  A|n]  <  oo.  If  it 

is  known  that  E[XS]  exists,  then  it  can  be  shown  that  E[Xr]  exists  for  r  <  s  (see 

Problem  6.23).  This  also  says  that  if  E[Xr]  is  known  not  to  exist,  then  E[XS]  cannot 
exist  for  s  >  r.  An  example  is  the  Cauchy  PDF  for  which  we  saw  that  E[X]  does 
not  exist  and  therefore  all  the  higher  order  moments  do  not  exist.  In  particular, 
the  Cauchy  PDF  does  not  have  a  second-order  moment  and  therefore  its  variance 
does  not  exist.  We  next  give  an  example  of  the  computation  of  all  the  moments  of 
a  PDF. 

Example  11.6  —  Moments  of  an  exponential  random  variable 

Using  (11.10)  we  have  for  X  ~  exp(A)  that 

roc 

E[Xn]  =  /  xnXexp(—Xx)dx. 

Jo 

To  evalute  this  we  first  show  how  the  nth  moment  can  be  written  recursively  in  terms 
of  the  (n  —  l)st  moment.  Since  we  know  that  E[X]  =  1/A,  we  can  then  determine 


var(X) 


b  — 
1 


b/_ 


(b- 

<b- 


b  —  a  3 

(b  -  a)2 

12 


u 
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all  the  moments  using  the  recursion.  We  can  begin  to  evaluate  the  integral  using 
integration  by  parts.  This  will  yield  the  recursive  formula  for  the  moments.  Letting 
U  =  xn  and  dV  =  Aexp(— A x)dx  so  that  dU  =  nxn~ldx  and  V  =  —  exp(— Arr),  we 
have 


r  oo 

E[Xn]  —  —  xn  exp(— A#)|^°  —  /  — exp(— A x)nxn~ldx 

Jo 

roc 

=  0  +  n  xn~l  exp(— \x)dx 

Jo 


roo 

/  xn~1Xexp(—Xx)dx 

Jo 


n 
X 


Hence,  the  nth  moment  can  be  written  in  term  of  the  (n  —  l)st  moment.  Since  we 
know  that  E[X]  —  1/A,  we  have  upon  using  the  recursion  that 

etc. 

and  in  general 

E[Xn]  =  (11.14) 

The  variance  can  be  found  to  be  var(X)  =  1/A2  using  these  results. 

❖ 

In  the  next  section  we  will  see  how  to  use  characteristic  functions  to  simplify  the 
complicated  integration  process  required  for  moment  evaluation. 

Lastly,  it  is  sometimes  important  to  be  able  to  compute  moments  about  some 
point.  For  example,  the  variance  is  the  second  moment  about  the  point  E[X].  In 
general,  the  nth  central  moment  about  the  point  E[X]  is  defined  as  E[(X  —  E[X])n]. 
The  relationship  between  the  moments  and  the  central  moments  is  of  interest.  For 
n  =  2  the  central  moment  is  related  to  the  moments  by  the  usual  formula  E\{X  — 
E[X])2]  =  E[X2]  —  E2[X],  More  generally,  this  relationship  is  found  using  the 
binomial  theorem  as  follows. 


-  E 


n 

E  (!)  XH-BIX}) 


n—k 


Lfc=0 


n 

=  ^  E[Xk](— E[X])n~k  (linearity  of  expectation  operator) 

k= o 


E[(X  -  E[X])n } 


11.7.  CHARACTERISTIC  FUNCTIONS 


359 


or  finally  we  have  that 


E[(X  -  E[X])n]  =  5^(-l)n-fc 

k= 0 


(E[X])n~k  E[Xk]. 


(11.15) 


11.7  Characteristic  Functions 


As  first  introduced  for  discrete  random  variables,  the  characteristic  function  is  a 
valuable  tool  for  the  calculation  of  moments.  It  is  defined  as 


4>x{w)  =  E[exp(jojX) 


(11.16) 


and  always  exists  (even  though  the  moments  of  a  PDF  may  not).  For  a  continuous 
random  variable  it  is  evaluated  using  (11.10)  for  the  real  and  imaginary  parts  of 
E[exp(jojX)],  which  are  £’[cos(o;X)]  and  .E[sin(u;X)].  This  results  in 

/OO 

exp(jujx)px(x)dx 

-OO 

or  in  more  familiar  form  as 

/OO 

px(x)exp(jux)dx.  (11.17) 

-OO 


The  characteristic  function  is  seen  to  be  the  Fourier  transform  of  the  PDF,  although 
with  a  +j  in  the  definition  as  opposed  to  the  more  common  —j.  Once  the  charac¬ 
teristic  function  has  been  found,  the  moments  are  given  as 


E[Xn]  = 


1_  dn4>x  M 
jn  du)n 


uj= 0 


(11.18) 


An  example  follows. 

Example  11.7  —  Moments  of  the  exponential  PDF 

Using  the  definition  of  the  exponential  PDF  (see  (10.5))  we  have 


< fix  (w)  = 


POO 

JO 


Xexp(-Xx)  exp (jojx)dx 


POO 

/  Aexp[— (A  —  juo)x]dx 

Jo 


X 


exp  [—(A  —  ju)x] 

-(A  -ju) 

A 


OO 


0 


A-  ju 


—  (exp[— (A  -  ju)oo]  -  1) . 
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But  exp[— (A  —  ju)x]  — >  0  as  x  ->  oo  since  A  >  0  and  hence  we  have 

A 


4>  x{u) 


X-  ju 


(11.19) 


To  find  the  moments  using  (11.18)  we  need  to  differentiate  the  characteristic  function 
n  times.  Proceeding  to  do  so 


d<t>x  (u) 
doj 

d2<f>x(u) 

du1 


d 


A(A  —  ju) 


-l 


du 

A(— 1)(A  —  ju)~2{-j) 


A(-1)(-2)(A  -ju)-\-j) 


dn(f>x(u) 

du)n 


=  A(-l)(-2) . . .  (-n)(A  -  jurn-\-j)n 
=  Xjnn\{X-  ju)~n~l 


and  therefore 


E[Xn]  = 


1  dnj)x(u) 


\n 


dujn 


=  An!  (A  —  ju) 

_  n* 

“  A^ 

which  agrees  with  our  earlier  results  (see  (11.14)). 


cj— 0 
-71  —  1 


o;=0 


Moment  formula  only  valid  if  moments  exist 


Just  because  a  PDF  has  a  characteristic  function,  and  all  do,  does  not  mean  that 
(11.18)  can  be  applied.  For  example,  the  Cauchy  PDF  has  the  characteristic  function 
(see  Problem  11.40) 

<l>x  M  =  exp(— |o;|) 

(although  the  derivative  does  not  exist  at  u  =  0).  However,  as  we  have  already 
seen,  the  mean  does  not  exist  and  hence  all  higher  order  moments  also  do  not  exist. 
Thus,  no  moments  exist  at  all  for  the  Cauchy  PDF. 

A 

The  characteristic  function  has  nearly  the  same  properties  as  for  a  discrete  random 
variable,  namely 
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1.  The  characteristic  function  always  exists. 

2.  The  PDF  can  be  recovered  from  the  characteristic  function  by  the  inverse  Fourier 

transform,  which  in  this  case  is 


Px 


<f>x{u)  exp(-ju)x) 


duo 
2i r 


(11.20) 


3.  Convergence  of  a  sequence  of  characteristic  functions  <f>x\u >)  for  n  =  1, 2, ...  to  a 
given  characteristic  function  cf)(uo )  guarantees  that  the  corresponding  sequence 

of  PDFs  Pp  ( X )  for  n  =  1,2,...  converges  to  p(x ),  where  from  (11.20) 


cj)(oj)  exp (—juox) 


duo 
27 r 


(See  Problem  11.42  for  an  example.)  This  property  is  also  essential  for  proving 
the  central  limit  theorem  described  in  Chapter  15. 

A  slight  difference  from  the  characteristic  function  of  a  discrete  random  variable 
is  that  now  is  not  periodic  in  c o.  It  does,  however,  have  the  usual  proper¬ 

ties  of  the  continuous-time  Fourier  transform  [Jackson  1991].  A  summary  of  the 
characteristic  functions  for  the  important  PDFs  is  given  in  Table  11.1. 


11.8  Probability,  Moments,  and  the  Chebyshev  Inequal¬ 
ity 


The  mean  and  variance  of  a  random  variable  indicate  the  average  value  and  variabil¬ 
ity  of  the  outcomes  of  a  repeated  experiment.  As  such,  they  summarize  important 
information  about  the  PDF.  However,  they  are  not  sufficient  to  determine  proba¬ 
bilities  of  events.  For  example,  the  PDFs 

1 
1 

~7=e 

V2 

both  have  E[X]  =  0  (due  to  symmetry  about  x  =  0)  and  var(X)  =  1.  Yet,  the 
probability  of  a  given  interval  can  be  very  different.  Although  the  relationship 
between  the  mean  and  variance,  and  the  probability  of  an  event  is  not  a  direct  one, 
we  can  still  obtain  some  information  about  the  probabilities  based  on  the  mean  and 
variance.  In  particular,  it  is  possible  to  bound  the  probability  or  to  be  able  to  assert 
that 


1  . 

exP  I  ~2X‘ 


(Gaussian) 


-V2\  x  |  (Laplacian) 


px(x)  = 
px  ( x )  = 


P[\X-E[X]\  >7]  <B 


362 


CHAPTER  11.  EXPECTED  VALUES 


where  B  is  a  number  less  than  one.  This  is  especially  useful  if  we  only  wish  to 
make  sure  the  probability  is  below  a  certain  value,  without  explicitly  having  to  find 
the  probability.  For  example,  if  the  probability  of  a  speech  signal  of  mean  0  and 
variance  1  exceeding  a  given  magnitude  7  (see  Section  10.10)  is  to  be  no  more  than 
1%,  then  we  would  be  satisfied  if  we  could  determine  a  7  so  that 


P[\X-E[X]\  >7]  <0.01. 


We  now  show  that  the  probability  for  the  event  \X  —  E[X]  \  >  7  can  be  bounded  if 
we  know  the  mean  and  variance.  Computation  of  the  probability  is  not  required  and 
therefore  the  PDF  does  not  need  to  be  known.  Estimating  the  mean  and  variance  is 
much  easier  than  the  entire  PDF  (see  Section  11.9).  The  inequality  to  be  developed 
is  called  the  Chebyshev  inequality.  Using  the  definition  of  the  variance  we  have 


/' 00 

(x  -  E[X])2px(x)dx 

-OO 


-  i 

1  (x-E[X])2Px(x)dx  + 

/  (x  -  E[X])2px(x)dx 

J 

{x:\x- E[X]\>i]  j 

/* 

'  {x:\x- £?[X]|<7) 

>  1 

j  (x  -  E[X])2px{x)dx 

(omitted  integral  is  nonnegative) 

J 

{x:\x- E[X]\>i} 

/» 

>  ! 

j  j2px(x)dx 

(since  for  each  x,  \x  —  E[X]\  >  7) 

J 

{x:\x- £[X]|>7) 

/ 
J{  a 


7-  /  px{x)dx 

[x:\x- E[X\\>')} 

.2 


=  Y P[\X  -  E[X]\  >  7] 


so  that  we  have  the  Chebyshev  inequality 


r  xri  1  1  var(X)  . 

P[\X-E[X]\  >7]  < - (11.21) 

Hence,  the  probability  that  a  random  variable  deviates  from  its  mean  by  more 
than  7  (in  either  direction)  is  less  than  or  equal  to  var(X)/72.  This  agrees  with 
our  intuition  in  that  the  probability  of  an  outcome  departing  from  the  mean  must 
become  smaller  as  the  width  of  the  PDF  decreases  or  equivalently  as  the  variance 
decreases.  An  example  follows. 

Example  11.8  —  Bounds  for  different  PDFs 

Assuming  E[X\  =  0  and  var(A)  =  1,  we  have  from  (11.21) 

p[  1*1  >7]  <4 

T 

If  7  =  3,  then  we  have  that  P[|X|  >  3]  <  1/9  «  0.11.  This  is  a  rather  “loose” 
bound  in  that  if  X  ~  yV"(0. 1),  then  the  actual  value  of  this  probability  is  P[|X|  > 
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3]  =  2Q(3)  =  0.0027.  Hence,  the  actual  probability  is  indeed  less  than  or  equal  to 
the  bound  of  0.11,  but  quite  a  bit  less.  In  the  case  of  a  Laplacian  random  variable 
with  mean  0  and  variance  1,  the  bound  is  the  same  but  the  actual  value  is  now 


P[\X\  >  3] 


Ll  71 

2  J  -j=  exp  y—V2 xj  dx  (PDF  is  symmetric  about  x  =  0) 


(^—y/2\x\j  dx  +  J  ~^=  exp  >/2|#|^  dx 


=  exp 


*cp  y/2xj 

(-3V5)  = 


00 

3 


0.0144. 


Once  again  the  bound  is  seen  to  be  correct  but  provides  a  gross  overestimation  of 
the  probability.  A  graph  of  the  Chebyshev  bound  as  well  as  the  actual  probabilities 
of  P[|X|  >  7]  versus  7  is  shown  in  Figure  11.6.  The  reader  may  also  wish  to  consider 


Figure  11.6:  Probabilities  P[|X|  >  7]  for  Gaussian  and  Laplacian  random  variables 
with  zero  mean  and  unity  variance  compared  to  Chebyshev  inequality. 

what  would  happen  if  we  used  the  Chebyshev  inequality  to  bound  P[|X|  >  0.5]  if 

X  ~  V(0,1). 

0 


11.9  Estimating  the  Mean  and  Variance 

The  mean  and  variance  of  a  continuous  random  variable  are  estimated  in  exactly 
the  same  way  as  for  a  discrete  random  variable  (see  Section  6.8).  Assuming  that  we 
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have  the  M  outcomes  {.x'i ,  x->- . . . ,  .x\\/ }  of  a  random  variable  X  the  mean  estimate 
is 

_  i  m 

=  (1L22) 

1  2—1 


and  the  variance  estimate  is 


var(X) 


(11.23) 


An  example  of  the  use  of  (11.22)  was  given  in  Example  2.6  for  a  N(0, 1)  random 
variable.  Some  practice  with  the  estimation  of  the  mean  and  variance  is  provided 
in  Problem  11.46. 


11.10  Real-World  Example  —  Critical  Software  Testing 
Using  Importance  Sampling 

Computer  software  is  a  critical  component  of  nearly  every  device  used  today.  The 
failure  of  such  software  can  range  from  being  an  annoyance,  as  in  the  outage  of  a 
cellular  telephone,  to  being  a  catastrophe,  as  in  the  breakdown  of  the  control  system 
for  a  nuclear  power  plant.  Testing  of  software  is  of  course  a  prerequisite  for  reliable 
operation,  but  some  events,  although  potentially  catastrophic,  will  (hopefully)  occur 
only  rarely.  Therefore,  the  question  naturally  arises  as  to  how  to  test  software  that  is 
designed  to  only  fail  once  every  107  hours  («  1400  years).  In  other  words,  although 
a  theoretical  analysis  might  predict  such  a  low  failure  rate,  there  is  no  way  to  test 
the  software  by  running  it  and  waiting  for  a  failure.  A  technique  that  is  often  used  in 
other  fields  to  test  a  system  is  to  “stress”  the  system  to  induce  more  frequent  failures, 
say  by  a  factor  of  105,  then  estimate  the  probability  of  failure  per  hour,  and  finally 
readjust  the  probability  for  the  increased  stress  factor.  An  analogous  approach 
can  be  used  for  highly  reliable  software  if  we  can  induce  a  higher  failure  rate  and 
then  readjust  our  failure  probability  estimate  by  the  increased  factor.  A  proposed 
method  to  do  this  is  to  stress  the  software  to  cause  the  probability  of  a  failure  to 
increase  [Hecht  and  Hecht  2000].  Conceivably  we  could  do  this  by  inputting  data 
to  the  software  that  is  suspected  to  cause  failures  but  at  a  much  higher  rate  than  is 
normally  encountered  in  practice.  This  means  that  if  T  is  the  time  to  failure,  then 
we  would  like  to  replace  the  PDF  of  T  so  that  P[T  >  7]  increases  by  a  significant 
factor.  Then,  after  estimating  this  probability  by  exercising  the  software  we  could 
adjust  the  estimate  back  to  the  original  unstressed  value.  This  probabilitic  approach 
is  called  importance  sampling  [Rubinstein  1981]. 

As  an  example  of  the  use  of  importance  sampling,  assume  that  X  is  a  continuous 
random  variable  and  we  wish  to  estimate  P[X  >7].  As  usual,  we  could  generate 
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realizations  of  X,  count  the  number  that  exceed  7,  and  then  divide  this  by  the 
total  number  of  realizations.  But  what  if  the  probability  sought  is  10-7?  Then  we 
would  need  about  109  realizations  to  do  this.  As  a  specific  example,  suppose  that 
X  ~  A/^(0, 1),  although  in  practice  we  would  not  have  knowledge  of  the  PDF  at 
our  disposal,  and  that  we  wish  to  estimate  P[X  >  5]  based  on  observed  realization 
values.  The  true  probability  is  known  to  be  Q(5)  =  2.86  x  10“7.  The  importance 
sampling  approach  first  recognizes  that  the  desired  probability  is  given  by 


and  is  equivalent  to 


1  = 


L 


00 


\/27T 


exp  (— 5X2) 


Px '  ( x)dx 


5  Px*  (a) 

where  px'(%)  is  a  more  suitable  PDF.  By  “more  suitable”  we  mean  that  its  prob¬ 
ability  of  X'  >  5  is  larger,  and  therefore,  generating  realizations  based  on  it  will 
produce  more  occurrences  of  the  desired  event.  One  possibility  is  X'  ~  exp(l)  or 
Pxf(%)  —  exp(— x)u(x)  for  which  P[X  >  5]  =  exp(— 5)  =  0.0067.  Using  this  new 
PDF  we  have  the  desired  probability 


L 


00 


exp  (—5 x 2) 


\/2n 


exp(— x) 


exp(— x)dx 


or  using  the  indicator  function,  this  can  be  written  as 


Now  the  desired  probability  can  be  interpreted  as  E[g(X')\,  where  X'  ~  exp(l).  To 
estimate  it  using  a  Monte  Carlo  computer  simulation  we  first  generate  M  realizations 
of  an  exp(l)  random  variable  and  then  use  as  our  estimate 


1 


weight  with  value  1 
for  Xi  3>  5 


(11.24) 


The  advantage  of  the  importance  sampling  approach  is  that  the  realizations  whose 
values  exceed  5,  which  are  the  ones  contributing  to  the  sum,  are  much  more  proba¬ 
ble.  In  fact,  as  we  have  noted  P[X'  >  5]  =  0.0067  and  therefore  with  N  =  10, 000 
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realizations  we  would  expect  about  67  realizations  to  contribute  to  the  sum.  Con¬ 
trast  this  with  a  Af( 0, 1)  random  variable  for  which  we  would  expect  NQ( 5)  = 
(104)(2.86  x  10~7)  «  0  realizations  to  exceed  5.  The  new  PDF  pxf  is  called  the 
importance  function  and  hence  the  generation  of  realizations  from  this  PDF,  which 
is  also  called  sampling  from  the  PDF ,  is  termed  importance  sampling.  As  seen  from 
(11.24),  its  success  requires  a  weighting  factor  that  downweights  the  counting  of 
threshold  exceedances. 

In  software  testing  the  portions  of  software  that  are  critical  to  the  operation  of 
the  overall  system  would  be  exercised  more  often  than  in  normal  operation,  thus 
effectively  replacing  the  operational  PDF  or  px  by  the  importance  function  PDF 
or  px The  ratio  of  these  two  would  be  needed  as  seen  in  (11.24)  to  adjust  the 
weight  for  each  incidence  of  a  failure.  This  ratio  would  also  need  to  be  estimated  in 
practice.  In  this  way  a  good  estimate  of  the  probability  of  failure  could  be  obtained 
by  exercising  the  software  a  reasonable  number  of  times  with  different  inputs.  Oth¬ 
erwise,  the  critical  software  might  not  exhibit  a  failure  a  sufficient  number  of  times 
to  estimate  its  probability. 

As  a  numerical  example,  if  X'  ~  exp(l),  we  can  generate  realizations  using  the 
inverse  probability  transformation  method  (see  Section  10.9)  via  X'  =  —  ln(l  -  J7), 
where  U  ~U{ 0,1).  A  MATLAB  computer  program  to  estimate  X  is  given  below. 


rand  ('state*  ,0)  */. 

7. 

M= 10000 ;  gamma=5 ;  7. 
u=rand(M,l);  7. 

x=“log(l-u) ;  7o 

k=0; 


sets  random  number  generator  to 
initial  value 

change  M  for  different  estimates 
generates  M  U(0,1)  realizations 
generates  M  exp(l)  realizations 


for  i=l:M  7* 

if  x(i)>gamma 


computes  estimate  of  P[X>gamma] 


k=k+l ; 


y(k,l)=(l/sqrt (2*pi))*exp(-0.5*x(i)~2+x(i)) ; 


7*  computes  weights 
7*  for  estimate 


end 


end 

Qest=sum(y)/M  7*  final  estimate  of  P[X>gamma] 


The  results  are  summarized  in  Table  11.2  for  different  values  of  M,  along  with  the 
true  value  of  Q{ 5).  Also  shown  are  the  number  of  times  7  was  exceeded.  Without 
the  use  of  importance  sampling  the  number  of  exceedances  would  be  expected  to  be 
MQ( 5)  «  0  in  all  cases. 
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M 

Estimated  P[X  >5] 

True  P[X  >  5] 

Exceedances 

103 

1.11  x  10~7 

2.86  x  10~7 

4 

104 

2.96  x  10“7 

2.86  x  KT7 

66 

105 

2.51  x  10~7 

2.86  x  KT7 

630 

106 

2.87  x  10~7 

2.86  x  10~7 

6751 

Table  11.2:  Importance  sampling  approach  to  estimation  of  small  probabilities. 
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Problems 

li.i  (0)(f)  The  block  shown  in  Figure  11.7  has  a  mass  of  1  kg.  Find  the  center 
of  mass  for  the  block,  which  is  the  point  along  the  x-axis  where  the  block 
could  be  balanced  (in  practice  the  point  would  also  be  situated  in  the  depth 
direction  at  1/2). 


Figure  11.7:  Block  for  Problem  11.1. 
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11.2  (t)  Prove  that  if  the  PDF  is  symmetric  about  a  point  x  =  a,  which  is  to  say 
that  it  satisfies  px(a  +  u)  —  pxia  —  'u)  for  all  — oo  <  u  <  oo,  then  the  mean  will 
be  a.  Hint:  Write  the  integral  xpx(x)dx  as  f^.00xpx(x)dx+f^°  xpx{x)dx 
and  then  let  u  =  x  —  a  in  the  first  integral  and  u  —  a  —  x  in  the  second  integral. 

11.3  (c)  Generate  and  plot  50  realizations  of  a  Cauchy  random  variable.  Do  so  by 
using  the  inverse  probability  integral  transformation  method.  You  should  be 
able  to  show  that  X  =  tan(7r({7  —  1/2)),  where  U  ~U( 0,1),  will  generate  the 
Cauchy  realizations. 

11.4  (c)  In  this  problem  we  show  via  a  computer  simulation  that  the  mean  of 
a  truncated  Cauchy  PDF  exists  and  is  equal  to  zero.  A  truncated  Cauchy 
random  variable  is  one  in  which  the  realizations  of  a  Cauchy  PDF  are  set  to 
x  —  rmax  if  x  >  xmax  and  x  =  —  xmax  if  x  <  —  xmax.  Generate  realizations 
of  this  random  variable  with  xmax  =  50  and  plot  the  sample  mean  versus  the 
number  of  realizations.  What  does  the  sample  mean  converge  to? 

11.5  (t)  Prove  that  the  best  prediction  of  the  outcome  of  a  continuous  random 
variable  is  its  mean.  Best  is  to  be  interpreted  as  the  value  that  minimizes  the 
mean  square  error  mse(6)  =  E[(X  —  b )2]. 

11.6  (t)  An  even  function  is  one  for  which  g(— x)  =  g(x ),  as  for  example  cos(x). 
An  odd  function  is  one  for  which  g(— x)  =  —g(x),  as  for  example  sin(x).  First 
prove  that  g(x)dx  =  2  /0°°  g(x)dx  if  g{x)  is  even  and  that  g(x)dx  =  0 
if  g(x)  is  odd.  Next,  prove  that  if  Px{x)  is  even,  then  E[X]  =  0  and  also  that 
/o°°  Px{x)dx  =  1/2. 

11.7  (f)  Many  integrals  encountered  in  probability  can  be  evaluated  using  Integra- 
tion  by  parts.  This  useful  formula  is 

J  UdV  =  UV-  j  VdU 

where  U  and  V  are  functions  of  x.  As  an  example,  if  we  wish  to  evaluate 
f  xexp (ax)dx,  we  let  U  —  x  and  dV  =  exp (ax)dx.  The  function  U  is  easily 
differentiated  to  yield  dU  —  dx  and  the  differential  dV  is  easily  integrated  to 
yield  V  =  (1/a)  exp  (ax).  Continue  the  derivation  to  determine  the  integral  of 
the  function  xexp(ax). 

11.8  (f)  Find  the  mean  for  a  uniform  PDF.  Do  so  by  first  using  the  definition  and 
then  rederive  it  using  the  results  of  Problem  11.2. 

11.9  (t)  Consider  a  continuous  random  variable  that  can  take  on  values  xmin  < 
x  ^  ^max-  Prove  that  the  expected  value  of  this  random  variable  must  satisfy 
^min  <  E\X]  <  xmax.  Hint:  Use  the  fact  that  if  M\  <  g(x)  <  M2,  then 
Mia  <  g(x)dx  <  M^b. 
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11.10  (^)  (w)  The  signal-to- noise  ratio  (SNR)  of  a  random  variable  quantifies  the 
accuracy  of  a  measurement  of  a  physical  quantity.  It  is  defined  as  E2[X]/vai(X) 
and  is  seen  to  increase  as  the  mean,  which  represents  the  true  value,  increases 
and  also  as  the  variance,  which  represents  the  power  of  the  measurement  error, 
i.e.,  X  —  E[X ],  decreases.  For  example,  if  X  ~  A/"(/i,  cr2),  then  SNR  =  /i2/a2. 
Determine  the  SNR  if  the  measurement  is  X  =  A  +  [/,  where  A  is  the  true 
value  and  U  is  the  measurement  error  with  U  ~7/(— l/2,l/2).  For  an  SNR  of 
1000  what  should  A  be? 

11.11  (o)  (w)  A  toaster  oven  has  a  failure  time  that  has  an  exponential  PDF.  If 
the  mean  time  to  failure  is  1000  hours,  what  is  the  probability  that  it  will  not 
fail  for  at  least  2000  hours? 

11.12  (w)  A  bus  always  arrives  late.  On  the  average  it  is  10  minutes  late.  If  the 
lateness  time  is  an  exponential  random  variable,  determine  the  probability 
that  the  bus  will  be  less  than  1  minute  late. 

11.13  (w)  In  Section  1.3  we  described  the  amount  of  time  an  office  worker  spends 
on  the  phone  in  a  10-minute  period.  From  Figure  1.5  what  is  the  average 
amount  of  time  he  spends  on  the  phone? 

11.14  (0)(f)  Determine  the  mean  of  a  x%  PDF.  See  Chapter  10  for  the  definition 
of  this  PDF. 

11.15  (f)  Determine  the  mean  of  an  Erlang  PDF  using  the  definition  of  expected 
value.  See  Chapter  10  for  the  definition  of  this  PDF. 

11.16  (f)  Determine  the  mean  of  a  Rayleigh  PDF  using  the  definition  of  expected 
value.  See  Chapter  10  for  the  definition  of  this  PDF. 

11.17  (w)  The  mode  of  a  PDF  is  the  value  of  x  for  which  the  PDF  is  maximum.  It 
can  be  thought  of  as  the  most  probable  value  of  a  random  variable  (actually 
most  probable  small  interval).  Find  the  mode  for  a  Gaussian  PDF  and  a 
Rayleigh  PDF.  How  do  they  relate  to  the  mean? 

11.18  (f)  Indicate  on  the  PDFs  shown  in  Figures  10.7-10.13  the  location  of  the 
mean  value. 

11.19  (s^/)  (w)  A  dart  is  thrown  at  a  circular  dartboard.  If  the  distance  from  the 
bullseye  is  a  Rayleigh  random  variable  with  a  mean  value  of  10,  what  is  the 
probability  that  the  dart  will  land  within  1  unit  of  the  bullseye? 

11.20  (f)  For  the  random  variables  described  in  Problems  2.8-2.11  what  are  the 
means?  Note  that  the  uniform  random  variable  is  7/(0, 1)  and  the  Gaussian 
random  variable  is  Af( 0, 1). 
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11.21  (^)  (w)  In  Problem  2.14  it  was  asked  whether  the  mean  of  y/U,  where  U  ~ 

U( 0, 1),  is  equal  to  V mean  of  U.  There  we  relied  on  a  computer  simulation  to 
answer  the  question.  Now  prove  or  disprove  this  equivalence. 

11.22  (^)  (w)  A  sinusoidal  oscillator  outputs  a  waveform  s(t )  =  cos(27rFo£  +  0), 
where  t  indicates  time,  Fo  is  the  frequency  in  Hz,  and  <f>  is  a  phase  angle 
that  varies  depending  upon  when  the  oscillator  is  turned  on.  If  the  phase  is 
modeled  as  a  random  variable  with  ~  ZY(0, 2n),  determine  the  average  value 
of  s(t)  for  a  given  t  =  to-  Also,  determine  the  average  power,  which  is  defined 
as  E[s2(t )]  for  a  given  t  =  to-  Does  this  make  sense?  Explain  your  results. 

11.23  (f)  Determine  E[X 2]  for  a  A/"(/i,<r2)  random  variable. 

11.24  (f)  Determine  E[(2X  +  l)2]  for  a  A/*(/i,cr2)  random  variable. 

11.25  (f )  Determine  the  mean  and  variance  for  the  indicator  random  variable  Ia{X) 
as  a  function  of  P[A], 

11.26  (^)  (w)  A  half-wave  rectifier  passes  a  zero  or  positive  voltage  undisturbed 
but  blocks  any  negative  voltage  by  outputting  a  zero  voltage.  If  a  noise  sample 
with  PDF  J\f( 0,  a2)  is  input  to  a  half-wave  rectifier,  what  is  the  average  power 
at  the  output?  Explain  your  result. 

11.27  (^)  (w)  A  mixed  PDF  is  given  as 

Px(x)  =  ls(x)  +  Tib 

What  is  E[X2]  for  this  PDF?  Can  this  PDF  be  interpreted  physically?  Hint: 
See  Problem  11.26. 

11.28  (t)  In  this  problem  we  derive  an  alternative  formula  for  the  mean  of  a  non¬ 
negative  random  variable.  A  more  general  formula  exists  for  random  variables 
that  can  take  on  both  positive  and  negative  values  [Parzen  I960].  If  X  can 
only  take  on  values  x  >  0,  then 

roo 

E[X}=  (1  -Fx{x))dx. 

Jo 

First  verify  that  this  formula  holds  for  X  ~  exp  (A).  To  prove  that  the  formula 
is  true  in  general,  we  use  integration  by  parts  (see  Problem  11.7)  as  follows. 

OO 

(1  -  Fx(x))dx 


E[X' 
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Finish  the  proof  by  using  linx^oo  x  f£°  px(t)dt  =  0,  which  must  be  true  if  the 
expected  value  exists  (see  if  this  holds  for  X  ~  exp(A)). 

11.29  (t)  The  standard  deviation  a  of  a  Gaussian  PDF  can  be  interpreted  as  the 
distance  from  the  mean  at  which  the  PDF  curve  goes  through  an  inflection 
point.  This  means  that  at  the  points  x  =  p±a  the  second  derivative  of  px(%) 
is  zero.  The  curve  then  changes  from  being  concave  (shaped  like  a  fl)  to  being 
convex  (shaped  like  a  U).  Show  that  the  second  derivative  is  zero  at  these 
points. 

11.30  (o)  (w)  The  office  worker  described  in  Section  1.3  will  spend  an  average  of 
7  minutes  on  the  phone  in  any  10-minute  interval.  However,  the  probability 
that  he  will  spend  exactly  7  minutes  on  the  phone  is  zero  since  the  length  of 
this  interval  is  zero.  If  we  wish  to  assert  that  he  will  spend  between  Tm[n  and 
Tmax  minutes  on  the  phone  95%  of  the  time,  what  should  Tm[n  and  Tmax  be? 
Hint:  There  are  multiple  solutions  -  choose  any  convenient  one. 

11.31  (w)  A  group  of  students  is  found  to  weigh  an  average  of  150  lbs.  with  a  stan¬ 
dard  deviation  of  30  lbs.  If  we  assume  a  normal  population  (in  the  probabilis¬ 
tic  sense!)  of  students,  what  is  the  range  of  weights  for  which  approximately 
99.8%  of  the  students  will  lie?  Hint:  There  are  multiple  solutions  -  choose 
any  convenient  one. 

11.32  (w)  Provide  a  counterexample  to  disprove  that  var(#i(X)  +  #2P0)  = 
var(#i(X))  +  var(#2p0)  in  general. 

11.33  (w)  The  SNR  of  a  random  variable  was  defined  in  Problem  11.10.  Determine 
the  SNR  for  exponential  random  variable  and  explain  why  it  doesn’t  increase 
as  the  mean  increases.  Compare  your  results  to  a  jV(/i,cr2)  random  variable 
and  explain. 

11.34  (f)  Verify  the  mean  and  variance  for  a  Laplacian  random  variable  given  in 
Table  11.1. 

11.35  (o)  (f)  Determine  E[X3]  if  X  ~  J\f(ij,,a2).  Next  find  the  third  central 
moment. 


11.36  (f)  An  example  of  a  Gaussian  mixture  PDF  is 


Px{x)  =  i-4=exp 


2^2 


7 r 


(x  - 1): 


1  1 

+  -t==  exp 


2  y/2 


7 r 


2  <*  +  1>! 


Determine  its  mean  and  variance. 


11.37  (t)  Prove  that  if  a  PDF  is  symmetric  about  x  =  0,  then  all  its  odd-order 
moments  are  zero. 
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11.38  (o)  (f)  F°r  a  Laplacian  PDF  with  a2  =  2  determine  all  the  moments.  Hint: 
Let 


1  _  1_  (  1 
uo2  +  1  2 j  —  j 


11.39  (f)  If  X  ~  A/"(0,  cr2),  determine  E[X2]  using  the  characteristic  function  ap¬ 
proach. 

11.40  (t)  To  determine  the  characteristic  function  of  a  Cauchy  random  variable  we 
must  evaluate  the  integral 


1 

7r(l  +  x 2) 


exp  (juox)dx. 


A  result  from  Fourier  transform  theory  called  the  duality  theorem  asserts  that 
the  Fourier  transform  and  inverse  Fourier  transform  are  nearly  the  same  if  we 
replace  x  by  uo  and  u)  by  x.  As  an  example,  for  a  Laplacian  PDF  with  a2  —  2 
we  have  from  Table  11.1  that 


/oo  roc  ^ 

px(x)exp(jujx)dx  -  /  -exp(— IzDexpO'wa^ete  =  — — ^ . 

-OO  J  —  OO  "  1  CJ 


The  inverse  Fourier  transform  relationship  is  therefore 


/ 


oo 


1 


-oo  1  +  W 


exp(-ju)x) 


duo  1 


2tv 


2  exp(— |x|). 


Use  the  latter  integral,  with  appropriate  modifications  (note  that  x  and  uo  are 
just  variables  which  we  can  redefine  as  desired),  to  obtain  the  characteristic 
function  of  a  Cauchy  random  variable. 


11.41  (f)  If  the  characteristic  function  of  a  random  variable  is 


(j)x  (co) 


(  sinaA 
\  w  / 


2 


find  the  PDF.  Hint:  Recall  that  when  we  convolve  two  functions  together  the 
Fourier  transform  of  the  new  function  is  the  product  of  the  individual  Fourier 
transforms.  Also,  see  Table  11.1  for  the  characteristic  function  of  a  U(— 1, 1) 
random  variable. 

11.42  (^)  (w)  If  X ~  1  /n),  determine  the  PDF  of  the  limiting  random 

variable  X  as  n  — >  oo.  Use  characteristic  functions  to  do  so. 


11.43  (f)  Find  the  mean  and  variance  of  a  Xn  random  variable  using  the  charac¬ 
teristic  function. 


PROBLEMS 


373 


11.44  (^)  (f)  The  probability  that  a  random  variable  deviates  from  its  mean  by 
an  amount  7  in  either  direction  is  to  be  less  than  or  equal  to  1/2.  What  should 
7  be? 

11.45  (f)  Determine  the  probability  that  \X\  >  7  if  X  ~  U[— a,  a].  Next  compare 
these  results  to  the  Chebyshev  bound  for  a  —  2. 

11.46  (^)  (c)  Estimate  the  mean  and  variance  of  a  Rayleigh  random  variable  with 
<72  =  1  using  a  computer  simulation.  Compare  your  estimated  results  to  the 
theoretical  values. 

11.47  (c)  Use  the  importance  sampling  method  described  in  Section  11.10  to  de¬ 
termine  Q( 7).  If  you  were  to  generate  M  realizations  of  a  A/"(0, 1)  random 
variable  and  count  the  number  that  exceed  7  =  7  as  is  usually  done  to  esti¬ 
mate  a  right-tail  probability,  what  would  M  have  to  be  (in  terms  of  order  of 
magnitude)? 


Appendix  11A 


Partial  Proof  of  Expected  Value 
of  Function  of  Continuous 
Random  Variable 


For  simplicity  assume  that  Y  =  g(X)  is  a  continuous  random  variable  with  PDF 
py(y)  (having  no  impulses).  Also,  assume  that  y  =  g(x)  is  monotonically  increasing 
so  that  it  has  a  single  solution  to  the  equation  y  =  g(x)  for  all  y  as  shown  in  Figure 
11A.1.  Then 


9{x) 


Figure  11A.1:  Monotonically  increasing  function  used  to  derive  E[g(X)}. 


ypr(y)dy 

ypx{g~l{y )) 


dg  Hy) 
dy 


(from  (10.30). 


Next  change  variables  from  y  to  x  using  x  =  g  1(y).  Since  we  have  assumed  that 
g(x)  is  monotonically  increasing,  the  limits  for  y  of  ±oo  also  become  ±oo  for  x. 
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Then,  since  x  =  g  1  (y),  we  have  that  ypx{d  1(y))  becomes  g(x)px(x)  and 


(g  is  monotonically  increasing, 

implies  g~l  is  monotonically  increasing, 
implies  derivative  is  positive) 


from  which  (11.10)  follows.  The  more  general  result  for  nonmonotonic  functions 
follows  along  these  lines. 


Chapter  12 


Multiple  Continuous  Random 
Variables 

12.1  Introduction 

In  Chapter  7  we  discussed  multiple  discrete  random  variables.  We  now  proceed  to 
parallel  that  discussion  for  multiple  continuous  random  variables.  We  will  consider 
in  this  chapter  only  the  case  of  two  random  variables,  also  called  bivariate  random 
variables ,  with  the  extension  to  any  number  of  continuous  random  variables  to  be 
presented  in  Chapter  14.  In  describing  bivariate  discrete  random  variables,  we  used 
the  example  of  height  and  weight  of  a  college  student.  Figure  7.1  displayed  the 
probabilities  of  a  student  having  a  height  in  a  given  interval  and  a  weight  in  a  given 
interval.  For  example,  the  probability  of  having  a  height  in  the  interval  [5*  8”,  6*] 
and  a  weight  in  the  interval  [160, 190]  lbs.  is  0.14  as  listed  in  Table  4.1  and  as  seen 
in  Figure  7.1  for  the  values  of  H  = 70  inches  and  W  =  175  lbs.  For  physical 
measurements  such  as  height  and  weight,  however,  we  would  expect  to  observe  a 
continuum  of  values.  As  such,  height  and  weight  are  more  appropriately  modeled 
by  multiple  continuous  random  variables.  For  example,  we  might  have  a  population 
of  college  students,  all  of  whose  heights  and  weights  lie  in  the  intervals  60  <  H  <  80 
inches  and  100  <  W  <  250  lbs.  Therefore,  the  continuous  random  variables  (iJ,  W) 
would  take  on  values  in  the  sample  space 

SH,w  =  {{h,w)  :  60  <h<  80, 100  <  w  <  250} 

which  is  a  subset  of  the  plane,  i.e.,  R 2.  We  might  wish  to  determine  probabilities 
such  as  P[61  <  H  <  67.5, 98.5  <  W  <  154],  which  cannot  be  found  from  Figure  7.1. 
In  order  to  compute  such  a  probability  we  will  define  a  joint  PDF  for  the  continuous 
random  variables  H  and  W.  It  will  be  a  two-dimensional  function  of  h  and  w.  In  the 
case  of  a  single  random  variable  we  needed  to  integrate  to  find  the  area  under  the 
PDF  as  the  desired  probability.  Now  integration  of  the  joint  PDF,  which  is  a  function 
of  two  variables,  will  produce  the  probability.  However,  we  will  now  be  determining 
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the  volume  under  the  joint  PDF.  All  our  concepts  for  a  single  continuous  random 
variable  will  extend  to  the  case  of  two  random  variables.  Computationally,  however, 
we  will  encounter  more  difficulty  since  two-dimensional  integrals,  also  known  as 
double  integrals,  will  need  to  be  evaluated.  Hence,  the  reader  should  be  acquainted 
with  double  integrals  and  their  evaluation  using  iterated  integrals. 


12.2  Summary 


The  concept  of  jointly  distributed  continuous  random  variables  is  introduced  in  Sec¬ 
tion  12.3.  Given  the  joint  PDF  the  probability  of  any  event  defined  on  the  plane 
is  given  by  (12.2).  The  standard  bivariate  Gaussian  PDF  is  given  by  (12.3)  and  is 
plotted  in  Figure  12.9.  The  concept  of  constant  PDF  contours  is  also  illustrated 
in  Figure  12.9.  The  marginal  PDF  is  found  from  the  joint  PDF  using  (12.4).  The 
joint  CDF  is  defined  by  (12.6)  and  is  evaluated  using  (12.7).  Its  properties  are 
listed  in  P12.1-P12.6.  To  obtain  the  joint  PDF  from  the  joint  CDF  we  use  (12.9). 
Independence  of  jointly  distributed  random  variables  is  defined  by  (12.10)  and  can 
be  verified  by  the  factorization  of  either  the  PDF  as  in  (12.11)  or  the  CDF  as  in 
(12.12).  Section  12.6  addresses  the  problem  of  determining  the  PDF  of  a  function 
of  two  random  variables — see  (12.13),  and  that  of  determining  the  joint  PDF  of 
a  function  which  maps  two  random  variables  into  two  new  random  variables.  See 
(12.18)  for  a  linear  transformation  and  (12.22)  for  a  nonlinear  transformation.  The 
general  bivariate  Gaussian  PDF  is  defined  in  (12.24)  and  some  useful  properties 
are  discussed  in  Section  12.7.  In  particular,  Theorem  12.7.1  indicates  that  a  linear 
transformation  of  a  bivariate  Gaussian  random  vector  produces  another  bivariate 
Gaussian  random  vector,  although  with  different  means  and  covariances.  Exam¬ 
ple  12.14  indicates  how  a  bivariate  Gaussian  random  vector  may  be  transformed  to 
one  with  independent  components.  Also,  a  formula  for  computation  of  the  expected 
value  of  a  function  of  two  random  variables  is  given  as  (12.28).  Section  12.9  discusses 
prediction  of  a  random  variable  from  the  observation  of  a  second  random  variable 
while  Section  12.10  summarizes  the  joint  characteristic  function  and  its  properties. 
In  particular,  the  use  of  (12.47)  allows  the  determination  of  the  PDF  of  the  sum 
of  two  continuous  and  independent  random  variables.  It  is  used  to  prove  that  two 
independent  Gaussian  random  variables  that  are  added  together  produce  another 
Gaussian  random  variable  in  Example  12.15.  Section  12.11  shows  how  to  simulate 
on  a  computer  a  random  vector  with  any  desired  mean  vector  and  covariance  ma¬ 
trix  by  using  the  Cholesky  decomposition  of  the  covariance  matrix — see  (12.53). 
If  the  desired  random  vector  is  bivariate  Gaussian,  then  the  procedure  provides  a 
general  method  for  generating  Gaussian  random  vectors  on  a  computer.  Finally,  an 
application  to  optical  character  recognition  is  described  in  Section  12.12. 
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12.3  Jointly  Distributed  Random  Variables 

We  consider  two  continuous  random  variables  that  will  be  denoted  by  X  and  Y.  As 
alluded  to  in  the  introduction,  they  represent  the  functions  that  map  an  outcome  s 
of  an  experiment  to  a  point  in  the  plane.  Hence,  we  have  that 

X(s)  1  [  x 

Y(s)  \  ~[y 

for  all  s  €  S.  An  example  is  shown  in  Figure  12.1  in  which  the  outcome  of  a  dart 
toss  s,  which  is  a  point  within  a  unit  radius  circular  dartboard,  is  mapped  into  a 
point  in  the  plane,  which  is  within  the  unit  circle.  The  random  variables  X  and  Y 


Figure  12.1:  Mapping  of  the  outcome  of  a  thrown  dart  to  the  plane  (example  of 
jointly  continuous  random  variables). 


are  said  to  be  jointly  distributed  continuous  random  variables.  As  before,  we  will 
denote  the  random  variables  as  ( X ,  Y)  or  [A  Y]1' .  in  either  case  referring  to  them  as 
a  random  vector.  Note  that  a  different  mapping  would  result  if  we  chose  to  represent 
the  point  in  Sx,y  in  polar  coordinates  (r,  6).  Then  we  would  have 

Sr,q  =  (M)  :  0  <  r  <  1,0  <  9  <  2tt}. 

This  is  a  different  random  vector  but  is  of  course  related  to  (X,Y).  Depending 
upon  the  shape  of  the  mapped  region  in  the  plane,  it  may  be  more  convenient  to 
use  either  rectangular  coordinates  or  polar  coordinates  for  probability  calculations 
(see  also  Problem  12.1). 

Typical  outcomes  of  the  random  variables  are  shown  in  Figure  12.2  as  points  in 
Sx,Y  for  two  different  players.  In  Figure  12.2a  100  outcomes  for  a  novice  dart  player 
are  shown  while  those  for  a  champion  dart  player  are  displayed  in  Figure  12.2b.  We 
might  be  interested  in  the  probability  that  \/X2  +  Y2  <  1/4,  which  is  the  event 
that  a  bullseye  is  attained.  Now  our  event  of  interest  is  a  two-dimensional  region  as 
opposed  to  a  one-dimensional  interval  for  a  single  continuous  random  variable.  In 
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(a)  Novice  (b)  Champion 

Figure  12.2:  Typical  outcomes  for  novice  and  champion  dart  player. 


the  case  of  the  novice  dart  player  the  dart  is  equally  likely  to  land  anywhere  in  the 
unit  circle  and  hence  the  probability  is 


P  [bullseye]  = 


Area  of  bullseye 

Total  area  of  dartboard 
7r(l/4)2  1 

7r(l)2  16 


However,  for  a  champion  dart  player  we  see  from  Figure  12.2b  that  the  probability  of 
a  bullseye  is  much  higher.  How  should  we  compute  this  probability?  For  the  novice 
dart  player  we  can  interpret  the  probability  calculation  geometrically  as  shown  in 
Figure  12.3  as  the  volume  of  the  inner  cylinder  since 


P[bullseye]  =  7r(l/4)2  x 


7 r 


Area  of  bullseye  x  — 

s - - '  7 r 


Area  of  event 


Height 


If  we  define  a  function 


Pxy(x  v)=)  l  x2  +  y2<  1 

’  ’  '0  otherwise 


then  this  volume  is  also  given  by 


p [A]  =  JJ  Px,r{x,y)dxdy 


(12.1) 


(12.2) 
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Px,v{x,y) 


Figure  12.3:  Geometric  interpretation  of  bullseye  probability  calculation  for  novice 
dart  thrower. 


In  analogy  with  the  definition  of  the  PDF  for  a  single  random  variable  X,  we  define 
Px,Y  (#,  y)  as  the  joint  PDF  of  X  and  Y.  For  this  example,  it  is  given  by  (12.1)  and 
is  used  to  evaluate  the  probability  that  (X,  Y)  lies  in  a  given  region  A  by  (12.2). 
The  region  A  can  be  any  subset  of  the  plane.  Note  that  in  using  (12.2)  we  are 
determining  the  volume  under  px,Y,  hence  the  need  for  a  double  integral.  Another 
example  follows. 

Example  12.1  —  Pyramid-like  joint  PDF 

A  joint  PDF  is  given  by 


Px,y(x,v)  = 


4(1  -  \2x  -  1|)(1  -  \2y  -  1|)  0  <  x  <  1,0  <  y  <  1 
0  otherwise. 


We  wish  to  first  verify  that  the  PDF  integrates  to  one.  Then,  we  consider  the 
evaluation  of  P [1/4  <  X  <  3/4, 1/4  <  Y  <  3/4].  A  three-dimensional  plot  of  the 
PDF  is  shown  in  Figure  12.4  and  appears  pyramid-like.  Since  it  is  often  difficult  to 
visualize  the  PDF  in  3-D,  it  is  helpful  to  plot  the  contours  of  the  PDF  as  shown 


382 


CHAPTER  12.  MULTIPLE  CONTINUOUS  RANDOM  VARIABLES 


Figure  12.4:  Three-dimensional  plot  of  joint  PDF. 


x 


Figure  12.5:  Contour  plot  of  joint  PDF. 


in  Figure  12.5.  As  seen  in  the  contour  plot  (also  called  a  topographical  map)  the 
innermost  contour  consists  of  all  values  of  ( x,y )  for  which  px,Y(x,y)  =  3.5.  This 
contour  is  obtained  by  slicing  the  solid  shown  in  Figure  12.4  with  a  plane  parallel 
to  the  x-y  plane  and  at  a  height  of  3.5  and  similarly  for  the  other  contours.  These 
contours  are  called  contours  of  constant  PDF. 

To  verify  that  px.v  is  indeed  a  valid  joint  PDF,  we  need  to  show  that  the  volume 
under  the  PDF  is  equal  to  one.  Since  the  sample  space  is  Sx,y  =  {{x,y)  :  0  <  x  < 
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1)  0  <  y  <  1}  we  have  that 

P[Sx,y]  =  f  f1 4(1  -  \2x  -  1\)(1  -  \2y  -  l\)dx dy 

Jo  Jo 

=  [  2(1  —  \2x  —  l\)dx  f  2(1  —  \2y  —  l\)dy. 

Jo  Jo 

The  two  definite  integrals  are  seen  to  be  identical  and  hence  we  need  only  evaluate 
one  of  these.  But  each  integral  is  the  area  under  the  function  shown  in  Figure  12.6a 
which  is  easily  found  to  be  1.  Hence,  P[Sx,y]  =  1-1  =  1,  verifying  that  px,Y  is  a 


Figure  12.6:  Plot  of  function  g(x)  =  2(1  —  \2x  —  1|). 


valid  PDF.  Next  to  find  P[  1/4  <  X  <  3/4, 1/4  <  Y  <  3/4]  we  use  (12.2)  to  yield 

/*3/4  rS/A 

piA]=  /  4(1  —  \2x  —  1|)(1  —  \2y  —  l\)dx  dy. 

J 1/4  J 1/4 

By  the  same  argument  as  before  we  have 

'  /*3/4 

P[A]  =  2(1  -  \2x  -  1| )dx 

[J  1/4 

and  referring  to  Figure  12.6b,  we  have  that  each  unshaded  triangle  has  an  area  of 
(1/2)(1/4)(1)  =  1/8  and  so 
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In  summary,  a  joint  PDF  has  the  expected  properties  of  being  a  nonnegative  two- 
dimensional  function  that  integrates  to  one  over  R2. 

0 

For  the  previous  example  the  double  integral  was  easily  evaluated  since 

1.  The  integrand  was  separable  (we  will  see  shortly  that  this  property 

will  hold  when  the  random  variables  are  independent). 

2.  The  integration  region  in  the  x-y  plane  was  rectangular. 

More  generally  this  will  not  be  the  case.  Consider,  for  example,  the  computation 
of  P[Y  <  X\.  We  need  to  integrate  pxy  over  the  shaded  region  shown  in  Figure 
12.7.  To  do  so  we  first  integrate  in  the  y  direction  for  a  fixed  x ,  shown  as  the  darkly 


Figure  12.7:  Integration  region  to  determine  P[Y  <X}. 

shaded  region.  Since  0  <  y  <  x  for  a  fixed  x ,  we  have  the  limits  of  0  to  x  for  the 
integration  over  y  and  the  limits  of  0  to  1  for  the  final  integration  over  x.  This 
results  in 

P[Y<X]  =  [  f  px,Y{x,y)dydx 

Jo  Jo 

x 

4(1  —  \2x  —  1 1)  (1  —  |2  y  —  1|  )dydx. 

Although  the  integration  can  be  carried  out,  it  is  tedious.  In  this  illustration  the 
joint  PDF  is  separable  but  the  integration  region  is  not  rectangular. 


Zero  probability  events  are  more  complex  in  two  dimensions. 


Recall  that  for  a  single  continuous  random  variable  the  probability  of  X  attaining 
any  value  is  zero.  This  is  because  the  area  under  the  PDF  is  zero  for  any  zero  length 
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interval.  Similarly,  for  jointly  continuous  random  variables  X  and  Y  the  probability 
of  any  event  defined  on  the  x-y  plane  will  be  zero  if  the  region  of  the  event  in  the 
plane  has  zero  area.  Then,  the  volume  under  the  joint  PDF  will  be  zero.  Some 
examples  of  these  zero  probability  events  are  shown  in  Figure  12.8. 


y  y  y 


Figure  12.8:  Examples  of  zero  probability  events  for  jointly  distributed  continuous 
random  variables  X  and  Y.  All  regions  in  the  x-y  plane  have  zero  area. 


a 

An  important  joint  PDF  is  the  standard  bivariate  Gaussian  or  normal  PDF,  which 
is  defined  as 


Px,v{x,y ) 


1 


2'Xs/l  —  p2 


exp 


1 


2(1  -  p2) 


( X 2  -  2 pxy  +  y2) 


OO  <  X  <  oo 

■oo  <  y  <  oo 


(12.3) 

where  p  is  a  parameter  that  takes  on  values  —  1  <  P  <  I-  (The  use  of  the  term 
standard  is  because  as  is  shown  later  the  means  of  X  and  Y  are  0  and  the  variances 
are  1.)  The  joint  PDF  is  shown  in  Figure  12.9  for  various  values  of  p.  We  will  see 
shortly  that  p  is  actually  the  correlation  coefficient  px,Y  first  introduced  in  Section 
7.9.  The  contours  of  constant  PDF  shown  in  Figures  12.9b,d,f  are  given  by  the 
values  of  (x,y)  for  which 


x 2  —  2  pxy  +  y2  =  r2 


where  r  is  a  constant.  This  is  because  for  these  values  of  (rr,y)  the  joint  PDF  takes 
on  the  fixed  value 


Px,y(x,v)  = 


1 

27T\/l  —  p 2 


exp 


2(1  —  P2) 


If  p  =  0,  these  contours  are  circular  as  seen  in  Figure  12. 9d  and  otherwise  they  are 
elliptical.  Note  that  our  use  of  r2,  which  implies  that  x 2  —  2 pxy  +  y2  >  0,  is  valid 
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which  is  a  quadratic  form.  Because  —  1  <  p  <  1,  the  matrix  is  positive  definite 
(its  principal  minors  axe  all  positive — see  Appendix  C)  and  hence  the  quadratic 
form  is  positive.  We  will  frequently  use  the  standard  bivariate  Gaussian  PDF  and 
its  generalizations  as  examples  to  illustrate  other  concepts.  This  is  because  its 
mathematical  tractability  lends  itself  to  easy  algebraic  manipulations. 

12.4  Marginal  PDFs  and  the  Joint  CDF 

The  marginal  PDF  px{%)  of  jointly  distributed  continuous  random  variables  X  and 
Y  is  the  usual  PDF  which  yields  the  probability  of  a  <  X  <  b  when  integrated  over 
the  interval  [a,  6].  To  determine  px(%)  if  we  are  given  the  joint  PDF  px,y(xi  !/)>  we 
consider  the  event  * 


A  —  {(#, y)  :  a  <  x  <  6,  — oo  <  y  <  00} 


whose  probability  must  be  the  same  as 


Ax  =  {x  :  a  <  x  <  b}. 


Thus,  using  (12.2) 


P[a  <  X  <  b\  = 


P[AX]  =  P[A] 

JJ  px,r(x,y)dxdy 


A 

*oo  rb 


no 

Px,r{x,y)dxdy 


rb  r  00 

/  /  px,v{x,y)dydx. 

J a  J  —00 


Px(x) 


Clearly  then,  we  must  have  that 


/oo 

Px,Y(x,y)dy 

-OO 


(12.4) 


as  the  marginal  PDF  for  X.  This  operation  is  shown  in  Figure  12.10.  In  effect,  we 
“sum”  the  probabilites  of  all  the  y  values  associated  with  the  desired  x ,  much  the 
same  as  summing  along  a  row  to  determine  the  marginal  PMF  px[%i\  from  the  joint 
PMF  px,y [^5  Vj\-  The  marginal  PDF  can  also  be  viewed  as  the  limit  as  Ax  — >  0  of 


px(x  0) 


P[x 0  —  Ax/2  <  X  <  xq  +  Ax/2,  — oo  <  Y  <  oo] 


A# 


G-Ax/2  IZ o  Px,y  (x,  y)dy  dx 


Ax 
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(a)  Curve  is  px,r(-l,y) 


(b)  Area  under  curve  is  px(— 1) 


Figure  12.10:  Obtaining  the  marginal  PDF  of  X  from  the  joint  PDF  of  (X,  Y). 


for  a  small  Ax.  An  example  follows. 

Example  12.2  -  Marginal  PDFs  for  Standard  Bivariate  Gaussian  PDF 

From  (12.3)  and  (12.4)  we  have  that 


1 

2  Try^l  —  p2 


exp 


1 

.”2(1  -p2) 


-  2 pxy  +  y2) 


(12.5) 


To  carry  out  the  integration  we  convert  the  integrand  to  one  we  recognize,  i.e., 
a  Gaussian,  for  which  the  integral  over  (— oo,  oo)  is  known.  The  trick  here  is  to 
“complete  the  square”  in  y  as  follows: 


Q  —  y2  -  2 pxy  +  x 2 

=  y2  —  2  pxy  +  p2x 2  +  x2  —  p2x 2 
=  (y  -  px)2  +  (1  -  p2)x2. 


Substituting  into  (12.5)  produces 


Px(x) 


exP(— (l/2)^2)  J 


OO 


OO 


1 


2n\/i  —  ~p‘ 


-i=eXp(-(l/2  )J)j_ 


OO 


oo 


\Jc2'ko1 


exp 


exp 


2(1  -  p2) 


(y  -  px) 


dy 


dy 


=1 


where  y  =  px  and  a2  —  1  —  p2,  so  that  we  have 


Px(x)  -  —4=  exp 


V2 


IT 
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or  X  ~  A/r(0, 1).  Hence,  the  marginal  PDF  for  X  is  a  standard  Gaussian  PDF. 
By  reversing  the  roles  of  X  and  Y,  we  will  also  find  that  Y  ~  J\f( 0, 1).  Note  that 
since  the  marginal  PDFs  are  standard  Gaussian  PDFs,  the  corresponding  bivariate 
Gaussian  PDF  is  also  referred  to  as  a  standard  one. 

❖ 

In  the  previous  example  we  saw  that  the  marginals  could  be  found  from  the  joint 
PDF.  However,  in  general  the  reverse  process  is  not  possible — given  the  marginal 
PDFs  we  cannot  determine  the  joint  PDF.  For  example,  knowing  that  X  ~  Af( 0, 1) 
and  Y  ~  Af( 0, 1)  does  not  allow  us  to  determine  p,  which  characterizes  the  joint 
PDF.  Furthermore,  the  marginal  PDFs  are  the  same  for  any  p  in  the  interval  (—1, 1). 
This  is  just  a  restatement  of  the  conclusion  that  we  arrived  at  for  joint  and  marginal 
PMFs.  In  that  case  there  were  many  possible  two-dimensional  sets  of  numbers,  i.e., 
specified  by  a  joint  PMF,  that  could  sum  to  the  same  one-dimensional  set,  i.e., 
specified  by  a  marginal  PMF. 

We  next  define  the  joint  CDF  for  continuous  random  variables  (X,  Y).  It  is  given 

by 

Fx,y(*>  V)  =P[X  <x,Y  <y\.  (12.6) 

From  (12.2)  it  is  evaluated  using 

Fx,Y(x,y)  —  f  [  px,Y(t,u)dtdu.  (12.7) 

J —  OO  J  —  OO 


Some  examples  follow. 

Example  12.3  -  Joint  CDF  for  an  exponential  joint  PDF 

If  (X,  Y)  have  the  joint  PDF 


Px,v(x,y)  = 


expj— (x  +  y)]  x  >  0,  y  >  0 
0  otherwise 


then  for  x  >  0,  y  >  0 

Fx,y(x,v) 


X 

exp[—  (t  +  u)\dtdu 

ry  rx 

I  exp (— u)  /  exp(— t)dtdu 

0  Jo 

v - v - ' 

1— exp(— x) 

y 

[1  —  exp(— x)]  exp(— u)du 

ry 

=  [1  —  exp(— #)]  /  exp(— u)du 

Jo 


f  [1  -  exp(-z)][l  -  exp(-y)] 

l  o 


x  >  0,y  >  0 
otherwise. 


so  that 


Fx,v{x,y)  = 


(12.8) 
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1 

^t0.8 

*0.6 

0.4 

0.2 

0 

4 


(a)  PDF 


(b)  CDF 


Figure  12.11:  Joint  exponential  PDF  and  CDF. 


The  joint  CDF  is  shown  in  Figure  12.11  along  with  the  joint  PDF.  Once  the  joint 
CDF  is  obtained  the  probability  for  any  rectangular  region  is  easily  found. 

0 


Example  12.4  -  Probability  from  CDF  for  exponential  random  variables 

Consider  the  rectangular  region  A  =  {(x,y)  :  1  <  x  <  2, 2  <  y  <  3}.  Then  referring 

y 

A 


Figure  12.12:  Evaluation  of  probability  of  rectangular  region  A  using  joint  CDF. 

to  Figure  12.12  we  determine  the  probability  of  A  by  determining  the  probability  of 
the  shaded  region,  then  subtracting  out  the  probability  of  each  cross-hatched  region 
(one  running  from  south-east  to  north-west  and  the  other  running  from  south-west 
to  north-east),  and  finally  adding  back  in  the  probability  of  the  double  cross-hatched 
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region  (which  has  been  subtracted  out  twice).  This  results  in 

P[A]  =  P\ — oo  <  X  <  2,  -oo  <  Y  <  3]  -  P[-oo  <  X  <  2,  — oo  <  Y  <  2] 

-P[-oo  <  X  <  1,  -oo  <  Y  <  3]  +  P[-oo  <  X  <  1,  -oo  <  Y  <  2] 
=  Fx,y [2, 3]  -  Fx,y[ 2) 2]  ~  fx,y[  1, 3]  +  Px,y  [1, 2]. 

For  the  joint  CDF  given  by  (12.8)  this  becomes 

P[A]  =  [1  —  exp(— 2)][1  —  exp(— 3)]  —  [1  —  exp(— 2)]2 

-[1  -  exp(— 1)][1  -  exp(— 3)]  +  [1  -  exp(-l)][l  -  exp(-2)]. 


Upon  simplication  we  have  the  result 


P[A]  —  [exp(— 1)  —  exp(— 2)][exp(— 2)  —  exp(— 3)] 


which  can  also  be  verified  by  a  direct  evaluation  as 


P[A}  =  a  exp[-(x  +  y)\dxdy. 

We  see  that  the  advantage  here  is  that  no  integration  is  required.  However,  the 
event  A  must  be  a  rectangular  region. 

❖ 

The  joint  PDF  can  be  recovered  from  the  joint  CDF  by  partial  differentiation  as 


Px,v(x,y ) 


d2-Fx,yQE,y) 

dxdy 


(12.9) 


which  is  the  two-dimensional  version  of  the  fundamental  theorem  of  calculus.  As  an 
example  we  continue  the  previous  one. 


Example  12.5  —  Obtaining  the  joint  PDF  from  the  joint  CDF  for  expo¬ 
nential  random  variables 

Continuing  with  the  previous  example  we  have  from  (12.8)  that 


Px,y{x,v)  = 


d2[l— exp(— x)][l— exp(— y) 
dxdy 

0 


x  >  0,  y  >  0 
otherwise. 


For  x  >  0,  y  >  0 


Px,y(x,v) 


d  5[1  —  exp(— m)] [1  —  exp(— y) 
dx  dy 

d[l  -  exp(— a:)]  d[l  -  exp(-y)] 
dx  dy 

exp(-®)  exp(-y)  =  exp[-(z  +  y)]. 


0 

Finally,  the  properties  of  the  joint  CDF  are  for  the  most  part  identical  to  those  for 
the  CDF  (see  Section  7.4  for  the  properties  of  the  joint  CDF  for  discrete  random 
variables).  They  are  (see  Figure  12.11b  for  an  illustration): 
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P12.1  Fx,y(~ °°)  —  °°)  =  0 

P12.2  ■Fx',r(+ oo, +oo)  =  1 
P12.3  Fx,y{xi °°)  =  Fx{x) 

P12.4  Fx,Y(°°>y )  =  Fy(y) 

P12.5  Fx,y(x^v)  is  monotonically  increasing,  which  means  that  if  x-2  >  x\  and 
2/2  >  Vi,  then  Fx,y(x2,P2 )  >  Fx,y{x i,2/i)- 


P12.6  Fx,y(xi  y)  is  continuous  with  no  jumps  (assuming  that  X  and  Y  are  jointly 
continuous  random  variables).  This  property  is  different  from  the  case  of 
jointly  discrete  random  variables. 


12.5  Independence  of  Multiple  Random  Variables 

The  definition  of  independence  of  two  continuous  random  variables  is  the  same  as  for 
discrete  random  variables.  Two  continuous  random  variables  X  and  Y  are  defined 
to  be  independent  if  for  all  events  A  G  R  and  B  G  R 


P[x  e  A,  Y  e  B]  =  P[x  e  A]P[Y  g  b\.  (12.10) 


Using  the  definition  of  conditional  probability  this  is  equivalent  to 


P[Y  G  B\X  G  A  = 


P[X  e  A,  Y  e  B] 
P[X  G  A] 
P[Y  G  B] 


and  similarly  P[X  G  A\Y  Gfi]  =  P[X  G  A].  It  can  be  shown  that  X  and  Y  are 
independent  if  and  only  if  the  joint  PDF  factors  as  (see  Problem  12.20) 


px,Y(x,y)  =px(x)pY{y)-  (12.11) 

Alternatively,  X  and  Y  are  independent  if  and  only  if  (see  Problem  12.21) 

Fx,Y(x,y)  =  Fx{x)Fy(v )•  (12.12) 


An  example  follows. 

Example  12.6  —  Independence  of  exponential  random  variables 

From  Example  12.3  we  have  for  the  joint  PDF 

f  exp [~{x  +  y)]  x>0,y>0 
\  0  otherwise. 


px,y(x,p)  = 
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Recalling  that  the  unit  step  function  u{x)  is  defined  as  u(x)  =  1  for  x  >  0  and 
u(x)  =  0  for  x  <  0,  we  have 

pxx(x>y)  =  exp[-(z  +  V)]u(x)u(y) 

since  u(x)u(y)  =  1  if  and  only  if  u(x)  =  1  and  u(y)  =  1,  which  will  be  true  for 
x  >  0,  y  >  0.  Hence,  we  have 


Px,y(xiV)  =  exp(-z)u(rr)  exp(-y)u(y) . 

N - V - /v - V - ' 

Px(x)  pY(y) 


To  assert  independence  we  need  only  factor  px,y(xiV)  as  where  g  and  h 

are  nonnegative  functions.  However,  to  assert  that  g(x)  is  actually  px(x)  and  h(y) 
is  actually  py(y),  each  function,  g  and  /i,  must  integrate  to  one.  For  example,  we 
could  have  factored  px,Y{x,y)  into  (1/2)  exp(— x)u(x)  and  2 exp (-y)u(y),  but  then 
we  could  not  claim  that  px(%)  —  (1/2) exp(  — x)u(x)  since  it  does  not  integrate  to 
one.  Note  also  that  the  joint  CDF  given  in  Example  12.3  is  also  factorable  as  given 
in  (12.8)  and  in  general,  factorization  of  the  CDF  is  also  necessary  and  sufficient  to 
assert  independence. 

0 


Assessing  independence  -  careful  with  domain  of  PI>F 


The  joint  PDF  given  by 


Px,v{x,y)  = 


2  exp[—  {x  +  y)]  x  >  0,  y  >  0,  and  y  <  x 
0  otherwise 


is  not  factorable,  although  it  is  very  similar  to  our  previous  example.  The  reason  is 
that  the  region  in  the  x-y  plane  where  px,y(xi  y)  7^  0  cannot  be  written  as  u(x)u(y) 
or  for  that  matter  as  any  g(x)h(y).  See  Figure  12.13. 


Figure  12.13:  Nonfactorable  region  in  x-y  plane. 


A 
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Example  12.7  —  Standard  bivariate  Gaussian  PDF 

From  (12.3)  we  see  that  px,y(x,y)  is  only  factorable  if  p  =  0.  From  Figure  12. 9d 
this  corresponds  to  the  case  of  circular  PDF  contours.  Specifically,  for  p  =  0,  we 
have 


Px,r(x,y) 


27 r 


exp 


-\(x2  +  y2) 


—  oo  <  x  <  oo,  — oo  <  y  <  oo 


1 

1  2 

1 

1  2 

/ —  exp 
y/2n 

v 

. —  exp 
v/2tt 

f2yJ 

px(x) 


Py(v) 


Hence,  we  observe  that  if  p  =  0,  then  X  and  Y  are  independent.  Furthermore,  each 
marginal  PDF  is  a  standard  Gaussian  (normal)  PDF,  but  as  shown  in  Example  12.2 
this  holds  regardless  of  the  value  of  p. 

0 

Finally,  note  that  if  we  can  assume  that  X  and  Y  are  independent,  then  knowledge 
oipx{x)  and  py(y)  is  sufficient  to  determine  the  joint  PDF  according  to  (12.11).  In 
practice,  the  independence  assumption  greatly  simplifies  the  problem  of  joint  PDF 
estimation  as  we  need  only  to  estimate  the  two  one-dimensional  PDFs  px(%)  and 

Py(v)- 


12.6  Transformations 

We  will  consider  two  types  of  transformations.  The  first  one  maps  two  continuous 
random  variables  into  a  single  continuous  random  variable  as  Z  =  g(X ,  F),  and  the 
second  one  maps  two  continuous  random  variables  into  two  new  continuous  random 
variables  as  W  =  g(X,Y)  and  Z  =  h(X,Y).  The  first  type  of  transformation 
Z  —  g(X,Y)  is  now  discussed.  The  approach  is  to  find  the  CDF  of  Z  and  then 
differentiate  it  to  obtain  the  PDF.  The  CDF  of  Z  is  given  as 

Fz{z)  =  P[Z  <  z ]  (definition  of  CDF) 

=  P\g(X,  Y)  <  z]  (definition  of  Z) 

=  [[  px,y(x,y)dxdy  (from  (12.2)).  (12.13) 

J  J {(x,y)-- 9 (x,y)<z} 

We  see  that  it  is  necessary  to  integrate  the  joint  PDF  over  the  region  in  the  plane 
where  g(x,y)  <  z.  Depending  upon  the  form  of  <7,  this  may  be  a  simple  task  or 

unfortunately  a  very  complicated  one.  A  simple  example  follows.  It  is  the  continuous 

version  of  (7.22),  which  yields  the  PMF  for  the  sum  of  two  independent  discrete 
random  variables. 
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Example  12.8  -  Sum  of  independent  U( 0, 1)  random  variables 

In  Section  2.3  we  inquired  as  to  the  distribution  of  the  outcomes  of  an  experiment 
that  added  J7i,  a  number  chosen  at  random  from  0  to  1,  to  U2,  another  number 
chosen  at  random  from  0  to  1.  A  histogram  of  the  outcomes  of  a  computer  simulation 
indicated  that  there  is  a  higher  probability  of  the  sum  being  near  1,  as  opposed  to 
being  near  0  or  2.  We  now  know  that  XJ\  ~  U{ 0,1),  U2  ~  U( 0,1).  Also,  in  the 
experiment  of  Section  2.3  the  two  numbers  were  chosen  independently  of  each  other. 
Hence,  we  can  determine  the  probabilities  of  the  sum  random  variable  if  we  first  find 
the  CDF  of  X  =  U\  +  C/2,  where  U\  and  U2  are  independent,  and  then  differentiate 
it  to  find  the  PDF  of  X.  We  will  use  (12.13)  and  replace  x,y,  z,  and  g(x,y)  by 
ui,U2,^r,  and  #(^1,^2),  respectively.  Then 

Fx{x)  =  //  pUuu2(ui,U2)duidu2. 

J  J  {(ui,U2):Ui+U2<x] 

To  determine  the  possible  values  of  X,  we  note  that  both  U\  and  U2  take  on  values  in 
(0, 1)  and  so  0  <  X  <  2.  In  evaluating  the  CDF  we  need  two  different  intervals  for  x 
as  shown  in  Figure  12.14.  Since  U\  and  U2  are  independent,  we  have  PUi,U2  =  PUiPu2 


(a)  0  <  x  <  1 


(b)  1  <x  <2 


Figure  12.14:  Shaded  areas  are  regions  of  integration  used  to  find  CDF. 
and  therefore  pui^i^u^)  =  1  for  0  <  u\  <  1  and  0  <  U2  <  1,  which  results  in 

Fx{x)  —  j  j  1  du\  du2  =  shaded  area  in  Figure  12.14. 

J  J {{ui,u2):ui+u2<x} 

Hence,  the  CDF  is  given  by 


Fx(x)  =  < 


0 

lr2 
2  ^ 

1  -  4(2  -  I)2 

1 


x  <  0 
0  <  x  <  1 
1  <  x  <  2 
x  >  2. 
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and  the  PDF  is  finally 


Px(x)  = 


dFx  (x) 

dx 

f  0 

x  <  0 

X 

/ 

0  <  x  <  1 

2-x 

1  <  x  <  2 

0 

x  >  2. 

This  PDF  is  shown  in  Figure  12.15.  This  is  in  agreement  with  our  computer  results 

px{x) 


Figure  12.15:  PDF  for  the  sum  of  two  independent  U{ 0, 1)  random  variables. 


shown  in  Figure  2.2.  The  highest  probability  is  at  x  =  1,  which  concurs  with  our 
computer  generated  results  of  Section  2.3.  Also,  note  that  px(%)  —  Pih  {%)  *Pu2(x)i 
where  ★  denotes  integral  convolution  (see  Problem  12.28). 

❖ 

More  generally,  we  can  derive  a  useful  formula  for  the  PDF  of  the  sum  of  two 
independent  continuous  random  variables.  According  to  (12.13),  we  first  need  to 
determine  the  region  in  the  plane  for  which  x+y  <  z.  This  inequality  can  be  written 
as  y  <  z  —  x,  where  z  is  to  be  regarded  for  the  present  as  a  constant  To  integrate 
Px,y{x,v)  over  this  region,  which  is  shown  in  Figure  12.16  as  the  shaded  region,  we 
can  use  an  iterated  integral.  Thus, 


(independence) 


(definition  of  CDF). 
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Figure  12.16:  Iterated  integral  evaluation  -  shaded  region  is  y  <  z  —  x.  Integrate 
first  in  y  direction  for  a  fixed  x  and  then  integrate  over  —  oo  <  x  <  oo. 


If  we  now  differentiate  the  CDF,  we  have 

d  r0° 


Pz(z ) 


dz 


/oo 

Px{x)Fy{z  —  x)dx 
-oo 


d 

/  Px{x)—Fy(z  —  x)dx 

J- oo  dz 


-OO 
‘OO 

—  OO 


px(x)  -^-Fy{u) 
du 


u=z—x 


du 

dz 


dx 


so  that  finally  we  have  our  formula 


(assume  interchange  is  valid) 


(chain  rule  with  u  =  z  —  x) 


/oo 

Px(x)py(z  ~  x)dx. 

-OO 


(12.14) 


This  is  the  analogous  result  to  (7.22).  It  is  recognized  as  a  convolution  integral, 
which  we  can  express  more  succinctly  as  pz  —  Px  *  Py,  and  thus  may  be  more 
easily  evaluated  by  using  characteristic  functions.  The  latter  approach  is  explored 
in  Section  12.10. 

A  second  approach  to  obtaining  the  PDF  of  g(X,Y)  is  to  let  W  =  X,  Z  = 
g(X,Y),  find  the  joint  PDF  of  W  and  Z,  i.e.,  pw,z{w,  z),  and  finally  integrate 
out  W  to  yield  the  desired  PDF  for  Z.  This  method  was  encountered  previously 
in  Chapter  7,  where  it  was  used  for  discrete  random  variables,  and  was  termed  the 
method  of  auxiliary  random  variables.  To  implement  it  now  requires  us  to  determine 
the  joint  PDF  of  two  new  random  variables  that  result  from  having  transformed  two 
random  variables.  This  is  the  second  type  of  transformation  we  were  interested  in. 
Hence,  we  now  consider  the  more  general  transformation 


W  =  g(X,Y ) 
Z  =  h(X,  Y). 
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The  final  result  will  be  a  formula  relating  the  joint  PDF  of  (W,  Z )  to  that  of  the 
given  joint  PDF  of  (X,  Y).  It  will  be  a  generalization  of  the  single  random  variable 
transformation  formula 


PY{y)=Px(g  1(y)) 


dg  1(y) 


(12.15) 


for  y  =  g(X). 

To  understand  what  is  involved,  consider  as  an  example  the  transformation 


'  X1 ' 

Ux 

.  *2  . 

.  (Ui  +  U2)/ 2  . 

(12.16) 


where  U\  ~  U{ 0, 1),  U2  ~  W(0, 1),  and  U\  and  U2  are  independent.  In  Figure  2.13 
we  plotted  realizations  of  [U\  U2Y  and  [X\  X^Y  •  Note  that  the  original  joint  PDF 
PUi,u2  ls  nonzero  on  the  unit  square  while  the  transformed  PDF  is  nonzero  on  a 
parallelogram.  In  either  case  the  PDFs  appear  to  be  uniformly  distributed.  Similar 
observations  about  the  region  for  which  the  PDF  of  the  transformed  random  variable 
is  nonzero  were  made  in  the  one-dimensional  case  for  Y  =  g(X),  where  X  «(0,1), 
in  Figure  10.22.  In  general,  a  linear  transformation  will  change  the  support  area  of 
the  joint  PDF,  which  is  the  region  in  the  plane  where  the  PDF  is  nonzero.  In  Figure 
2.13  it  is  seen  that  the  area  of  the  square  is  1  while  that  for  the  parallelogram  is  1/2. 
It  can  furthermore  be  shown  that  if  we  have  the  linear  transformation  (see  Problem 
12.29) 


(12.17) 


G 


then 


Area  in  w-z  plane 
Area  in  x-y  plane 


|det(G)| 

|  ad  —  bc\. 


It  is  always  assumed  that  G  is  invertible  so  that  det(G)  7^  0.  In  the  previous  example 
of  (12.16)  for  which  in  our  new  notation  we  have  W  =  X  and  Z  =  (X  +  Y)/ 2,  the 
linear  transformation  matrix  is 


L  2  2  J 

and  it  is  seen  that  |  det(G)|  =  1/2.  Thus,  the  PDF  support  region  is  decreased  by 
a  factor  of  2.  We  therefore  expect  the  joint  PDF  of  [X  (X  +  Y)/ 2]T  to  be  uniform 
with  a  height  of  2  (as  opposed  to  a  height  of  1  for  the  original  joint  PDF).  Hence, 
the  transformed  PDF  should  have  a  factor  of  1/|  det(G)|  to  make  it  integrate  to  one. 
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This  amplification  factor,  which  is  1/|  det(G)|  =  |  det(G  x)|  must  be  included  in  the 
expression  for  the  transformed  joint  PDF.  Also,  we  have  that  [xy]T  =  G  1[wz]T. 
Hence,  it  should  not  be  surprising  that  for  the  linear  transformation  of  (12.17)  we 
have  the  formula  for  the  transformed  joint  PDF 


Pw,z(w,z)  =  Px,y  G 


-1 


w 

z 


j) 


|det(G-1)| . 


(12.18) 


An  example  follows. 

Example  12.9  —  Linear  transformation  for  standard  bivariate  Gaussian 
PDF 

Assume  that  (X,  Y)  has  the  PDF  of  (12.3)  and  consider  the  linear  transformation 


'  w ' 

aw 

0 

‘  X  ‘ 

z 

V 

0 

az  . 

✓ 

Y 

Then, 


G-1  = 


1/cryv  0 
0  1/gz 


and 


G 


1 

w 

w/ Gw 

z 

z/az 

det(G  1)  = 


(JW&Z 


so  that  from  (12.3)  and  (12.18) 


2iry/l  —  p 2 


exp 


2(1  -  p2) 


Pw,z{w,z) 

(( w/aw )2  -  2pwz/(awoz)  +  ( z/az )2) 


awcrz 


27T 


1  -  P2)crVVcrl 

1 


exp 


w 


2(1  —  p 2)  l  \aw 


-2  p 


w 


GW 


<*Z 


+ 


°z 


.  (12.19) 


Note  that  since  —  oo  <  x  <  oo,  — oo  <  y  <  oo,  we  have  that  the  region  of  support 
for  pw,z  is  — oo  <  w  <  oo,  — oo  <  z  <  oo.  Also,  the  joint  PDF  can  be  written  in 
vector/matrix  form  as  (see  Problem  12.31) 

1  (  1  r  ^  ^ 

pw,z{w,z)  =  z  ,.1/0/„sexp 


27rdet1/2(C) 


w 

z 


C 


l 


w 

z 


(12.20) 
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where 


P&WVZ 

p(Jz°w  0% 


(12.21) 


The  matrix  C  will  be  shown  later  to  be  the  covariance  matrix  of  W  and  Z  (see  Sec¬ 
tion  9.5  for  the  definition  of  the  covariance  matrix,  which  is  also  valid  for  continuous 
random  variables). 

0 

For  nonlinear  transformations  a  result  similar  to  (12.18)  is  obtained.  This  is  because 
a  two-dimensional  nonlinear  function  can  be  linearized  about  a  point  by  replacing 
the  usual  tangent  or  derivative  approximation  for  a  one-dimensional  function  by  a 
tangent  plane  approximation  (see  Problem  12.32).  Hence,  if  the  transformation  is 
given  by 


W  =  g(X,Y) 
Z  =  h(X,Y) 


then  a  given  point  in  the  w-z  plane  is  obtained  via  w  =  g(x,  y),  z  =  h(i r,  y).  Assume 
that  the  latter  set  of  equations  has  a  single  solution  for  all  (w,  z),  say 


X  =  g  l{w,z) 
y  -  hrl{w,z). 


Then  it  can  be  shown  that 


Pw,z(w,z)  =  Px,r(g  1(w,z),h  1(w,z)) 


det 


X 


y 


\9{w,  z)J 


(12.22) 


where 


d(x,y) 

d(w,z) 


dx 

dx 

dw 

dz 

dy 

§3L 

_  dw 

dz  J 

(12.23) 


is  called  the  Jacobian  matrix  of  the  inverse  transformation  from  [w  z]T  to  [xy]T  and 
is  sometimes  referred  to  as  J-1.  It  represents  the  compensation  for  the  amplifica¬ 
tion/reduction  of  the  areas  due  to  the  transformation.  For  a  linear  transformation 
G  it  is  given  by  J  =  G  (see  also  (12.15)  for  a  single  random  variable).  We  now 
illustrate  the  use  of  this  formula. 


Example  12.10  -  Affine  transformation  for  standard  bivariate  Gaussian 
PDF 


Let  (A,  Y)  have  a  standard  bivariate  Gaussian  PDF  and  consider  the  affine  trans¬ 
formation 


W 

Z 


aw  0 
0  az 


X 

+ 

Pw 

Y 

Pz 
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Then  using  (12.22)  we  first  solve  for  ( x,y )  as 


x  = 


V  = 


W  —  yw 

aw 
z-  Hz 

°z 


The  inverse  Jacobian  matrix  becomes 

d(x,y) 


d(w,z) 


1/aw  0 

0  1/az 


and  therefore,  since 


Px,v(x,y )  = 


we  have  from  (12.22) 


27r^/l  —  p2 


exp 


1 


L  2(1  -p2) 


(x2  -  2 pxy  +  y2) 


Pw,z(w,z ) 


27iVl  -  p2 


exp 


1 


w-  yw\  /w -pnA  fz-yz\ 

2(1  -  p2)  \  V  P  V  °w  )  V  °z  )  \  oz  ) 


or  finally 


Pw,z{w,z) 


2irJ(l  -  p2)a2a2 


wwz 


1 


&W&Z 


•  exp 


w 


— Hw\ 2 _ 


_  2p  (  VLim. +  f: 

ojv  /  V  &w  )  \  oz  )  \  <?z  J 


2(1 -p2) 

(12.24) 

This  is  called  the  bivariate  Gaussian  PDF.  If  yw  =  Hz  =  0  and  aw  =  &z  =  1,  then 
it  reverts  back  to  the  usual  standard  bivariate  Gaussian  PDF.  If  yw  —  Hz  =  0,  we 
have  the  joint  PDF  in  Example  12.9.  An  example  of  the  PDF  is  shown  in  Figure 
12.17. 

❖ 

The  bivariate  Gaussian  PDF  can  also  be  written  more  compactly  in  vector /matrix 
form  as 


Pw,z{wiz)  = 


1 


27rdet1/2(C) 


exp 


1 

2 


W  —  fiW 

z-  HZ 


n  T 


c 


-1 


w  —  hw 

z-  Hz 


(12.25) 


where  C  is  the  covariance  matrix  given  by  (12.21).  It  can  also  be  shown  that 
the  marginal  PDFs  are  W  ~  Af(yw,  oj'v)  and  Z  ~  Af{yz,  a2z)  (see  Problem  12.36). 
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(a)  Joint  PDF 


(b)  Contours  of  constant  PDF 


Figure  12.17:  Example  of  bivariate  Gaussian  PDF  with  pw  =  1,/iz  =  1,  &w  ~ 
3,  g\  —  1,  and  p  =  0.9. 


Hence,  the  marginal  PDFs  of  the  bivariate  Gaussian  PDF  are  obtained  by  inspection 
(see  Problem  12.37). 

Example  12.11  -  Transformation  of  independent  Gaussian  random  vari¬ 
ables  to  a  Cauchy  random  variable 

Let  X  ~  Af( 0, 1),  Y  ~  Af( 0, 1),  and  X  and  Y  be  independent.  Then  consider  the 
transformation  W  =  X,  Z  =  Y/X.  To  determine  Sw,z  note  that  w  =  x  so  that 
— oo  <  w  <  oo  and  since  z  —  y/x  with  — oo  <  x  <  oo,— oo  <  y  <  oo,  we  have 
— oo  <  0  <  oo.  Hence,  Sw,z  is  the  entire  plane.  To  find  the  joint  PDF  we  first  solve 
for  (x,  y)  as  x  =  w  and  y  =  xz  =  wz.  The  inverse  Jacobian  matrix  is 


d{x,  y) 
d(w,  z) 


1  0 

z  w 


so  that  |  det(d(x,  y)/d(w,  z))\  =  |w|.  Using  (12.22),  we  have 


Pw,z(w,z) 


1 


2n 

1 

2k 

J_ 

27 r 


exp 


exp 


exp 


-2  ^+y2) 


w 


1 


-K  +  ttiV) 


x=w,y=wz 

\w\ 


1 


+  ^  )w 


w 


It  is  of  interest  to  determine  the  marginal  PDFs.  Clearly,  the  marginal  of  W  =  X 
is  just  the  original  PDF  Af(0, 1).  The  marginal  PDF  for  Z ,  which  is  the  ratio  of  two 
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independent  0, 1)  random  variables,  is  found  from  (12.4)  as 


Pz(z) 


L 

L 


oo 


Pw,z(w,z)dw 


2tt 


exp 


~(l  +  z2)w2 


1 

=  —  w  exp 

7T  Jq 

1  exp[—  (1/2)  (1  +  z2)w2] 


— -(1  +  z2)w2 


\w\dw 

dw  (integrand  is  even  function) 


7T  —(1  +  Z 2) 

1 


oo 


0 


7r(l  +  Z 2) 


—  OO  <  £  <  oo 


which  is  recognized  as  the  Cauchy  PDF.  Hence,  the  PDF  of  Y/X ,  where  X  and  Y 
are  independent  standard  Gaussian  random  variables  is  Cauchy.  We  have  implicitly 
used  the  method  of  auxiliary  random  variables  to  derive  this  result.  Finally,  the 
observation  that  the  denominator  of  Y/X  is  a  standard  Gaussian  random  variable, 
with  significant  probability  of  being  near  zero,  may  help  to  explain  why  the  outcomes 
of  a  Cauchy  random  variable  are  as  large  as  they  are.  See  Figure  11.3. 

0 

The  next  example  is  of  great  importance  in  many  fields  of  science  and  engineering. 

Example  12.12  -  Magnitude  and  angle  of  jointly  Gaussian  distributed 
random  variables 

Let  X  ~  jV(0,  cr2),  Y  ~  A/"(0,  <r2),  and  X  and  Y  be  independent  random  variables. 
Then,  it  is  desired  to  find  the  joint  PDF  when  X  and  Y,  considered  as  Cartesian 
coordinates,  are  converted  to  polar  coordinates  via 


R>  0 

0  <  0  <  2tt.  (12.26) 

It  is  common  in  many  engineering  disciplines,  for  example,  in  radar,  sonar,  and 
communications,  to  transmit  a  sinusoidal  signal  and  to  process  the  received  signal 
by  a  digital  computer.  The  received  signal  will  be  given  by  s(t)  =  Acos(27rFot)  + 
B  sin(27rF()t)  for  a  transmit  frequency  of  Fq  Hz.  However,  because  the  received  signal 
is  due  to  the  sum  of  multiple  reflections  from  an  aircraft,  as  in  the  radar  example, 
the  values  of  A  and  B  are  generally  not  known.  Consequently,  they  are  modeled 
as  continuous  random  variables  with  marginal  PDFs  A  ~  jV"(0,  a2),  B  ~  A/*(0,cr2), 
and  where  A  and  B  are  independent.  Since  the  received  signal  can  equivalently  be 
written  in  terms  of  a  single  sinusoid  as  (see  Problem  12.42) 


R  =  VX2  +  Y2 

„  Y 

(3  =  arctan— 

A 


s(t)  =  \J A2  +  B2  cos(27rFot  —  arctan (B/A)) 


404 


CHAPTER  12.  MULTIPLE  CONTINUOUS  RANDOM  VARIABLES 


the  amplitude  is  a  random  variable  as  is  the  phase  angle.  Thus,  the  transformation  of 
(12.26)  is  of  interest  in  order  to  determine  the  joint  PDF  of  the  sinusoid’s  amplitude 
and  phase.  This  motivates  our  interest  in  this  particular  transformation. 

We  first  solve  for  (x,y)  as  x  =  rcos#,  y  =  r sin#.  Then  using  (12.22)  and 
replacing  w  by  r  and  z  by  6  we  have  the  inverse  Jacobian 


9(x,y) 
d(r,  9) 


cos  0  —  rsin# 
sin  6  r  cos  0 


and  thus 


Since 


=  r  >  0. 


Px,y(x,v)  =Px(x)pY(y)  =  r— 2  exp 

we  have  upon  using  (12.22) 


PR,e{r,  9) 


2i rcr2 


exp 


2  a2 


o* 


exp 


- rd 

2a2 


1 


PR(r) 


^2vr^ 

p@(9) 


r  >  0,  0  <  9  <  2n 
r  >  0,  0  <  9  <  2-k. 


Here  we  see  that  R  has  a  Rayleigh  PDF  with  parameter  a2,  0  has  a  uniform  PDF, 
and  R  and  ©  are  independent  random  variables. 

❖ 


12.7  Expected  Values 

The  expected  value  of  two  jointly  distributed  continuous  random  variables  X  and  Y, 
or  equivalently  the  random  vector  [XT]t,  is  defined  as  the  vector  of  the  expected 
values.  That  is  to  say 


Ex[X]  ' 

Ey[Y] 

Of  course  this  is  equivalent  to  the  vector  of  the  expected  values  of  the  marginal 
PDFs.  As  an  example,  for  the  bivariate  Gaussian  PDF  as  given  by  (12.24)  with 
W,  Z  replaced  by  X,Y,  the  marginals  are  N{px,v‘x)  and  Af(py,  Oy)  and  hence  the 
expected  value  or  equivalently  the  mean  of  the  random  vector  is 

Px 
P-Y 
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as  shown  in  Figure  12.17  for  px  =  Py  =  1. 

We  frequently  require  the  expected  value  of  a  function  of  two  jointly  distributed 
random  variables  or  of  Z  =  g(X,Y).  By  definition  this  is 

/OO 

zpz{z)dz. 

-OO 

But  as  in  the  case  for  jointly  distributed  discrete  random  variables  we  can  avoid  the 
determination  of  pz(z)  by  employing  instead  the  formula 

/oo  roc 

/  9(x,y)px,Y(x,y)dxdy.  (12.27) 

-oo  J  — OO 

To  remind  us  that  the  averaging  PDF  is  px,Y{x,y)  we  usually  write  this  as 

/oo  r  oo 

/  g(x,y)px,Y(x,y)dxdy.  (12.28) 

-oo  J — OO 

If  the  function  g  depends  on  only  one  of  the  variables,  say  X,  then  we  have 

Exy[g{X)\  = 


/oo  r  oo 

/  g(x)px,Y(x,y)dxdy 
-oo  J  —  oo 


— oo  «/  — oo 


/oo  roc 

g{x)  /  px,Y(x,y)dydx 
-oo  J  — oo 


Px(x) 


Ex[g{J 0]- 


As  in  the  case  of  discrete  random  variables  (see  Section  7.7),  the  expectation  has 
the  following  properties: 


1.  Linearity 


ExAaX  +  bY )  =  aEx[X\  +  bEY[Y ] 


and  more  generally 


Ex,y  [ag(X,  Y)  +  bh(X,  Y)\  =  aEx,Y[g(X,  Y)}  +  bEx,Y  [h(X,  Y)]. 


2.  Factorization  for  independent  random  variables 
If  X  and  Y  are  independent  random  variables 

Ex,y[XY]  =  Ex[X]Ey[Y]  (12.29) 

and  more  generally 

Ex,Y[g(X)h(Y)}  =  Ex[g(X)]EY[h(Y)}.  (12.30) 
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Also,  in  determining  the  variance  for  a  sum  of  random  variables  we  have 

var(X  +  Y)  =  var(X)  +  var(Y)  +  2cov(X,  Y)  (12.31) 

where  co v(X,Y)  —  EXy  [(N  —  Ex[X])(Y  —  Ey[Y])].  If  X  and  Y  are  independent, 
then  by  (12.30) 

cov(A,  Y)  =  Ex,y[(X  -  Ex[X})(Y  -  Ey[Y})\ 

=  Ex[(X  -  Ex[X])]Ey[(Y  -  Ey[Y])] 

=  0. 


The  covariance  can  also  be  computed  as 

cov(A,  Y)  =  Ex,y[XY }  -  Ex[X]Ey[Y] 

where 

/(X)  roc 

/  xypx,Y(x,y)dxdy. 

-OO  J  —  OO 


(12.32) 


(12.33) 


An  example  follows. 

Example  12.13  —  Covariance  for  standard  bivariate  Gaussian  PDF 

For  the  standard  bivariate  Gaussian  PDF  of  (12.3)  we  now  determine  cov(X,  Y). 
We  have  already  seen  that  the  marginal  PDFs  are  X  ~  A/"(0, 1)  and  Y  ^  A/"(0, 1)  so 
that  Ex[X]  —  Ey[Y]  —  0.  From  (12.32)  we  have  that  cov(A,  Y)  —  Ex,y[XY]  and 
using  (12.33)  and  (12.3) 


/ex)  roo 

/ 

OO  J  — 


xy 


OO  J  — OO 


27Ty/l  —  p2 


exp 


2(1  —  p2) 


( x 2  -  2pxy  +  y2) 


dx  dy. 


To  evaluate  this  double  integral  we  use  iterated  integrals  and  complete  the  square 
in  the  exponent  of  the  exponential  as  was  previously  done  in  Example  12.2.  This 
results  in 

Q  =  y2  -  2pxy  +  x2  =  (y  -  px)2  +  (1  -  p2)x2 

and  produces 

cov(A,  Y) 


•oo  roc 

—  oo  J  —  oo 

•oo  i 


xy 


27Ty/l  —  p 2 


exp 


X 


—  oo 


V2 


exp 


7 r 


— x‘ 
2 


•oo 


y 


— oo 


2(1  -P2) 

1 

y/2vr(l  -  p2) 


(y  -  px) 


1  2 

exp 

2X  _ 

exp 


2(1  -  fP) 


dx  dy 


(: y  ~  px) 


dy  dx. 


The  inner  integral  over  y  is  just  Ey [Y]  —  j ypy(y)dy ,  where  Y  ~  J\f(px ,  1  —  p2). 
Thus,  Ey[Y]  —  px  so  that 


dx 


r  2  i 

1  2 

/  px  —==  exp 
1-00  VTx 

2 

=  pEx[X2} 
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where  X  ~  Af( 0, 1).  But  Ex[X 2] 
have  finally  that 


=  var(X)  +  E\\X ]  =  1  +  02  =  1  and  therefore  we 
cov(X,y)  =  p. 


0 

With  the  result  of  the  previous  example  we  can  now  determine  the  correlation 
coefficient  between  X  and  Y  for  the  standard  bivariate  Gaussian  PDF.  Since  the 
marginal  PDFs  are  X  ~  Af( 0, 1)  and  Y  ~  Af( 0, 1),  the  correlation  coefficient  between 
X  and  Y  is 


Pxy  = 


co  v{X,Y) 

-y/var(X)var(y) 

P 

Via 


We  have  therefore  established  that  in  the  standard  bivariate  Gaussian  PDF,  the  pa¬ 
rameter  p  is  the  correlation  coefficient.  This  explains  the  orientation  of  the  constant 
PDF  contours  shown  in  Figure  12.9.  Also,  we  can  now  assert  that  if  the  correlation 
coefficient  between  X  and  Y  is  zero,  i.e.,  p  —  0,  and  X  and  Y  are  jointly  Gaussian 
distributed  (i.e.,  a  standard  bivariate  Gaussian  PDF),  then 


Px,v{x,y)  = 


1 


2'K\/l  —  p2 


exp 


exp 


2(1  -P2) 


(x2  -  2 pxy  +  y2) 


(x2  +  y2) 


1 

1  2 

1 

1  2l 

. —  exp 

\f2sii 

|_~  2  j 

'  ; - exp 

y/2ir 

['2yJ 

Px(x) 


Py(v) 


and  X  and  Y  are  independent.  This  also  holds  for  the  general  bivariate  Gaussian 
PDF  in  which  the  marginal  PDFs  are  X  A f (fix, Ox)  and  Y  ~  Af(py,Oy).  This 
result  provides  a  partial  converse  to  the  theorem  that  if  X  and  Y  are  independent, 
then  the  random  variables  are  uncorrelated,  but  only  for  this  particular  joint  PDF. 

Finally,  since  p  —  px,Y  we  have  from  (12.21)  upon  replacing  W  by  X  and  Z  by 
Y,  that 


P&XVY 

pay  Ox 

<?X  PX,Y<?X<JY 

PxyOyOx  Oy 

var(X)  cov(X,  Y) 
cov(y,  X)  var(y) 


(12.34) 
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is  the  covariance  matrix.  We  have  now  established  that  C  as  given  by  (12.34)  is 
actually  the  covariance  matrix.  Hence,  the  general  bivariate  Gaussian  PDF  is  given 
in  succinct  form  as  (see  (12.24)) 


Px,y(xiV )  = 


27rdetx/2(C)  6XP  1  2 


x-  [lx 

V-PY 


c 


-1 


x-  fix 
L  V-HY 


(12.35) 


where  C  is  given  by  (12.34)  and  is  the  covariance  matrix  (see  Section  also  9.5) 


var(X)  co v(X,Y) 

C ~  [  cov(Y,X)  var(Y) 

As  previously  mentioned,  an  extremely  important  property  of  the  bivariate  Gaussian 
PDF  is  that  uncorrelated  random  variables  implies  independent  random  variables. 
Hence,  if  the  covariance  matrix  in  (12.36)  is  diagonal ,  then  X  and  Y  are  inde¬ 
pendent.  We  have  shown  in  Chapter  9  that  it  is  always  possible  to  diagonalize  a 
covariance  matrix  by  transforming  the  random  vector  using  a  linear  transforma¬ 
tion.  Specifically,  if  the  random  vector  [XY]T  is  transformed  to  a  new  random 
vector  VT[X  Y]r,  where  V  is  the  modal  matrix  for  the  covariance  matrix  C,  then 
the  transformed  random  vector  will  have  a  diagonal  covariance  matrix.  Hence,  the 
transformed  random  vector  will  have  uncorrelated  components.  If  furthermore,  the 
transformed  random  vector  also  has  a  bivariate  Gaussian  PDF,  then  its  component 
random  variables  will  be  independent.  It  is  indeed  fortunate  that  this  is  true — a 
linearly  transformed  bivariate  Gaussian  random  vector  produces  another  bivariate 
Gaussian  random  vector ,  as  we  now  show.  To  do  so  it  is  more  convenient  to  use  a 
vector/matrix  representation  of  the  PDF.  Let  the  linear  transformation  be 


(12.36) 


'  W  ' 

■  X  ' 

=  G 

z 

Y 

where  G  is  an  invertible  2x2  matrix.  Assume  for  simplicity  that  p,x 
Then,  from  (12.35) 


py  =  0. 


Px,Y(x,y)  = 


2n  detx/2(C) 


exp 


1 

2 


x 

y 


X 

l  y 


and  using  (12.18) 


Pw,z{w,z )  —  PX,Y  G 


-l 


w 

z 


|det(G~1) 


27rdet1/2(C) 


i 

w 

2 

z 

g-iTc-1g 


W 

z 


|det(G_1)| . 
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But  it  can  be  shown  that  (see  Section  C.3  of  Appendix  C  for  matrix  inverse  and 
determinant  formulas) 

G-lTC_1G-1  =  Gr_1C-1G_1  =  (GCGt)-1 

and 

ide‘<G_,>i  = 

1 

(det(G)  det(G))1/2 
1 

(det(G)  det(Gr))1/2 

so  that 

IdetCG"1)!  _  _ 1 _ 

det1//2(C)  det1/2(C)(det(G)det(GT))1/2 

1 

(det(C)  det(G)  det(GT))1/2 

1 

(det(G)  det(C)  det(GT))1/2 
1 

det1/2(GCGT) ' 

Thus,  we  have  finally  that  the  PDF  of  the  linearly  transformed  random  vector  is 

i  f  i  \ w  ~\T  r  ^  1  \ 

pwMw’li)  =  l^HGCGT)^  ("4  2  J  (GCG  )_1  [  J  ) 


which  is  recognized  as  a  bivariate  Gaussian  PDF  with  zero  means  and  a  covariance 
matrix  GCGT.  This  also  agrees  with  Property  9.4.  We  summarize  our  results  in  a 
theorem. 


Theorem  12.7.1  (Linear  transformation  of  Gaussian  random  variables) 

If  (X,Y)  has  the  bivariate  Gaussian  PDF 


Px,v(x,y)  = 


27rdet1/2(C)eXP 


X  —  fix 
y-PY 


X-  nx 

v-py 


(12.37) 


and  the  random  vector  is  linearly  transformed  as 
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where  G  is  invertible,  then 


Pw,z{w,  z) 


27rdet1/2(GCGT) 


exp  -- 


w  —  fiw 

z-  Hz 


T 

(GCGt)-1 


w  —  nw 
z-  Hz 


where 


Hw 

Hz 


is  the  transformed  mean  vector. 


The  bivariate  Gaussian  PDF  with  mean  vector  /x  and  covariance  matrix  C  is  denoted 
by  C).  Hence,  the  theorem  may  be  paraphrased  as  follows — if  [XY]T  ~ 

A7(/x,C),  then  G [XY]T  ~  Af( G/x,GCGT).  An  example,  which  uses  results  from 
Example  9.4,  is  given  next. 


Example  12.14  -  Transforming  correlated  Gaussian  random  variables  to 
independent  Gaussian  random  variables 


Let  fix  —  Hy  —  0  and 


26  6 
6  26 


in  (12.37).  The  joint  PDF  and  its  constant  PDF  contours  are  shown  in  Figure  12.18. 
Now  transform  X  and  Y  according  to 


'  W  ' 

'  X  ' 

=  G 

z 

Y 

where  G  is  the  transpose  of  the  modal  matrix  V,  which  is  given  in  Example  9.4. 
Therefore 

G  =  VT  = 

so  that 


We  have  that 


W 

Z 


Y 

Y. 


gcgt  =  vrcv  = 


20  0 
0  32 
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x  10 

S 

-  8  N. 


>,6 


(a)  Joint  PDF 


Figure  12.18:  Example  of  joint  PDF  for 


(b)  Contours  of  constant  PDF 
correlated  Gaussian  random  variables. 


^  x  10 


(a)  Joint  PDF 


(b)  Contours  of  constant  PDF 


Figure  12.19:  Example  of  joint  PDF  for  transformed  correlated  Gaussian  random 
variables.  The  random  variables  are  now  uncorrelated  and  hence  independent. 


Pw,z(w,z ) 


1  (  1 

w 

T 

2tiV20  -32  P  1  2 

z 

1/20  0 
0  1/32 


l 

1  w2 

1 

*  OVD 

1  22' 

2  20_ 

2  32 

V2n  •  20 

X/27T  •  32 
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which  is  the  factorization  of  the  joint  PDF  of  W  and  Z  into  the  marginal  PDFs 
W  ~  JV(0, 20)  and  Z  ~  Af(0, 32).  Hence,  W  and  Z  are  now  independent  random 
variables,  each  with  a  marginal  Gaussian  PDF.  The  joint  PDF  pw.z  is  shown  in 
Figure  12.19.  Note  the  rotation  of  the  contour  plots  in  Figures  12.18b  and  12.19b. 
This  rotation  was  asserted  in  Example  9.4  (see  also  Problem  12.48). 

❖ 


12.8  Joint  Moments 

For  jointly  distributed  continuous  random  variables  the  A;-Zth  joint  moments  are 
defined  as  Ex,y[XkY1].  They  are  evaluated  as 

/OO  POO 

/  xkylpx,v{x,y)dxdy.  (12.38) 

-oo  J — OO 

An  example  for  k  =  l  =  1  and  for  a  standard  bivariate  Gaussian  PDF  of  Ex,y[XY] 
was  given  Example  12.13.  The  k-lth  joint  central  moments  are  defined  as  Ex,y[(X  — 
Ex[X])h(Y  -  Ey[Y])1]  and  are  evaluated  as 

/oo  roc 

/  ( x-Ex[X})k(y-EY[Y])lpx>Y(x,y)dxdy . 

-oo  J  —oo 

(12.39) 

Of  course,  the  most  important  case  is  for  k  =  l  =  1  for  which  we  have  the  cov(X,  Y). 
For  independent  random  variables  the  joint  moments  factor  as 

Ex,Y[XkYl]  =  Ex[Xk]EY[Yl] 

and  similarly  for  the  joint  central  moments. 


12.9  Prediction  of  Random  Variable  Outcome 


In  Section  7.9  we  described  the  prediction  of  the  outcome  of  a  discrete  random 
variable  based  on  the  observed  outcome  of  another  discrete  random  variable.  We  now 
examine  the  prediction  problem  for  jointly  distributed  continuous  random  variables, 
and  in  particular,  for  the  case  of  a  bivariate  Gaussian  PDF.  First  we  plot  a  scatter 
diagram  of  the  outcomes  of  the  random  vector  [X  Y]T  in  the  x-y  plane.  Shown  in 
Figure  12.20  is  the  result  for  a  random  vector  with  a  zero  mean  and  a  covariance 
matrix 


1  0.9 

0.9  1 


(12.40) 


Note  that  the  correlation  coefficient  is  given  by 


cov(X,T) 

^/var(X)var(T) 
0.9 

vTTT 


0.9. 
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X 

Figure  12.20:  100  outcomes  of  bivariate  Gaussian  random  vector  with  zero  means 
and  covariance  matrix  given  by  (12.40).  The  best  prediction  of  Y  when  X  =  x  is 
observed  is  given  by  the  line. 

It  is  seen  from  Figure  12.20  that  knowledge  of  X  should  allow  us  to  predict  the 
outcome  of  Y  with  some  accuracy.  To  do  so  we  adopt  as  our  error  criterion  the 
minimum  mean  square  error  (MSE)  and  use  a  linear  predictor  or  Y  =  aX  +  b.  From 
Section  7.9  the  best  linear  prediction  when  X  —  x  is  observed  is 

Y  =  Ey  |y]  +  (*  -  Ex[X\).  (12.41) 

For  this  example  the  best  linear  prediction  is 

Y  =  0  +  ?j-(a?-0)  =  0.9x  (12.42) 

and  is  shown  as  the  line  in  Figure  12.20.  Note  that  the  error  e  =  Y  —  0.9X  is  also 
a  random  variable  and  can  be  shown  to  have  the  PDF  e  ~  V(0, 0.19)  (see  Problem 
12.49).  Finally,  note  that  the  predictor,  which  was  constrained  to  be  linear  (actually 
affine  but  the  use  of  the  term  linear  is  commonplace),  cannot  be  improved  upon  by 
resorting  to  a  nonlinear  predictor.  This  is  because  it  can  be  shown  that  the  optimal 
predictor,  among  all  predictors,  is  linear  if  ( X ,  Y)  has  a  bivariate  Gaussian  PDF 
(see  Section  13.6).  Hence,  in  this  case  the  prediction  of  (12.42)  is  optimal  among  all 
linear  and  nonlinear  predictors. 
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12.10  Joint  Characteristic  Functions 

The  joint  characteristic  function  for  two  jointly  continuous  random  variables  X  and 
Y  is  defined  as 


4>x,y{^x,^y)  =  ExxiexP  [j{vxX  +  cjyY)] .  (12.43) 

It  is  evaluated  as 

/OO  POO 

/  Px,y(x,  y )  exp  [j (uxx  +  wyy)]  dx  dy  (12.44) 

-oo  J — OO 


and  is  seen  to  be  the  two-dimensional  Fourier  transform  of  the  PDF  (with  a  +j 
instead  of  the  more  common  —j  in  the  exponential).  As  in  the  case  of  discrete 
random  variables,  the  joint  moments  can  be  found  from  the  characteristic  function 
using  the  formula 


(12. 

cox  —uiy  =0 

Another  important  application  is  in  determining  the  PDF  for  the  sum  of  two  inde¬ 
pendent  continuous  random  variables.  As  shown  in  Section  7.10  for  discrete  random 
variables  and  also  true  for  jointly  continuous  random  variables,  if  X  and  Y  are 
independent,  then  the  characteristic  function  of  the  sum  Z  —  X  +  Y  is 


Ex,y[XkY1 2 3} 


1  9*+^}y(wxjWr) 


du^duL 


(t>z(w)  =  0x(o>)0y(o>)-  (12.46) 

If  we  were  to  take  the  inverse  Fourier  transform  of  both  sides  of  (12.46),  then  the 
PDF  oiX  +  Y  would  result.  Hence,  the  procedure  to  determine  the  PDF  of  X  +  Y, 
where  X  and  Y  are  independent  random  variables,  is 

1.  Find  the  characteristic  function  <f>x{w )  by  evaluating  the  Fourier  transform 

f-ooPx(x)  exp(jwx)dx  and  similarly  for  <^y(o;). 

2.  Multiply  <j>x{w)  and  4>y(w)  together  to  form  0x(^)0y(^)- 

3.  Finally,  find  the  inverse  Fourier  transform  to  yield  the  PDF  for  the  sum  Z  = 

X  +  Y  as 

/°0  XL. 

<\>x  {w)<j>Y  M  exp  (-juz)  — .  (12.47) 

-oo 

Alternatively,  one  could  convolve  the  PDFs  of  X  and  Y  using  the  convolution  integral 
of  (12.14)  to  yield  the  PDF  of  Z.  However,  the  convolution  approach  is  seldom  easier. 
An  example  follows. 

Example  12.15  -  PDF  for  sum  of  independent  Gaussian  random  variables 

If  X  ~  Af(nx,crx)  and  Y  ~  j\7(/iy,<jy)  and  X  and  Y  are  independent,  we  wish 
to  determine  the  PDF  of  Z  =  X  +  Y.  A  convolution  approach  is  explored  in 
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Problem  12.51.  Here  we  use  (12.47)  to  accomplish  the  same  task.  First  we  need  the 
characteristic  function  of  a  Gaussian  PDF.  From  Table  11.1  if  X  ~  M {ji.  a2),  then 


<t>x  (w)  -  exp 


Thus,  the  characteristic  function  for  X  +  Y  is 

4>x+y(u)  =  exp  -  ^crxuA  exp  (jcofiy  ~ 


2  .  .2 


=  exp  yju(nx  +  Mr)  -  ^(°x  +  Oy)^2)  ■ 


Since  this  is  again  the  characteristic  function  of  a  Gaussian  random  variable,  albeit 
with  different  parameters,  we  have  that  X+Y  ~  N^x+Hy-,  ^x+^y)*  (Recognizing 
that  the  characteristic  function  is  that  of  a  known  PDF  allows  us  to  avoid  inverting 
the  characteristic  function  according  to  (12.47).)  Hence,  the  PDF  of  the  sum  of 
independent  Gaussian  random  variables  is  again  a  Gaussian  random  variable  whose 
mean  is  [i  —  px  +  V>Y  and  whose  variance  is  a2  =  a\  +  cry.  The  Gaussian  PDF 
is  therefore  called  a  reproducing  PDF.  By  the  same  argument  it  follows  that  the 
sum  of  any  number  of  independent  Gaussian  random  variables  is  again  a  Gaussian 
random  variable  with  mean  equal  to  the  sum  of  the  means  and  variance  equal  to  the 
sum  of  the  variances.  In  Problem  12.53  it  is  shown  that  the  Gamma  PDF  is  also  a 
reproducing  PDF. 

0 

The  result  of  the  previous  example  could  also  be  obtained  by  appealing  to  Theorem 
12.7.1.  If  we  let 


'  w ' 

'  1 

0  ' 

'  X  ‘ 

z 

1 

1 

Y 

then  by  Theorem  12.7.1,  W  and  Z  =  X+Y  are  bivariate  Gaussian  distributed.  Also, 
we  know  that  the  marginals  of  a  bivariate  Gaussian  PDF  are  Gaussian  PDFs  and 
therefore  the  PDF  of  Z  —  X  +  Y  is  Gaussian.  Its  mean  is  px  +  I^Y  and  its  variance 
is  a2x  +  Gy,  the  latter  because  X  and  Y  are  independent  and  hence  uncorrelated. 


12.11  Computer  Simulation  of  Jointly  Continuous 
Random  Variables 

For  an  arbitrary  joint  PDF  the  generation  of  continuous  random  variables  is  most 
easily  done  using  ideas  from  conditional  PDF  theory.  In  Chapter  13  we  will  see 
how  this  is  done.  Here  we  will  consider  only  the  generation  of  a  bivariate  Gaussian 
random  vector.  The  approach  is  based  on  the  following  properties: 


416 


CHAPTER  12.  MULTIPLE  CONTINUOUS  RANDOM  VARIABLES 


1.  Any  affine  transformation  of  two  jointly  Gaussian  random  variables  results  in 
two  new  jointly  Gaussian  random  variables.  A  special  case,  the  linear  trans¬ 
formation,  was  proven  in  Section  12.7  and  the  general  result  summarized  in 
Theorem  12.7.1.  We  will  now  consider  the  affine  transformation 


'  w ' 

'  X  ' 

a 

=  G 

+ 

z 

Y 

b 

(12.48) 


2.  The  mean  vector  and  covariance  matrix  of  [W  Z]T  transform  according  to 


E 


W 

Z 


=  G  E 


X 

Y 


+ 


a 


(see  Problem  9.22)  (12.49) 


C w,z  —  GCx,yGT  (see  (Theorem  12.7.1))(12.50) 

where  we  now  use  subscripts  on  the  covariance  matrices  to  indicate  the  random 
variables. 


The  approach  to  be  described  next  assumes  that  X  and  Y  are  standard  Gaussian  and 
independent  random  variables  whose  realizations  are  easily  generated.  In  MATLAB 
the  command  randn(l ,  1)  can  be  used.  Otherwise,  if  only  W(0, 1)  random  variables 
are  available,  one  can  use  the  Box-Mueller  transform  to  obtain  X  and  Y  (see  Problem 
12.54).  Then,  to  obtain  any  bivariate  Gaussian  random  variables  (W,  Z)  with  a  given 
mean  [pw  Pz]T  and  covariance  matrix  C,w,z->  we  use  (12.48)  with  a  suitable  G  and 
[a  b]T  so  that 


P&Z&W 


P&W&Z 


(12.51) 


Since  it  is  assumed  that  X  and  Y  are  zero  mean,  from  (12.49)  we  choose  a  =  pw 
and  b  =  nz-  Also,  since  X  and  Y  are  assumed  independent,  hence  uncorrelated, 
and  with  unit  variances,  we  have 


Cx,y 


1  0 
0  1 


It  follows  from  (12.50)  that  Cw,z  =  GGT.  To  find  G  if  we  are  given  C^,  we  could 
use  an  eigendecomposition  approach  based  on  the  relationship  VTC w,zV  =  A  (see 
Problem  12.55).  Instead,  we  next  explore  an  alternative  approach  which  is  somewhat 
easier  to  implement  in  practice.  Let  G  be  a  lower  triangular  matrix 
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Then,  we  have  that 


GGt 


1 

o 

<3 

i _ 

a  b 

a 2  ab 

b  c 

0  c 

ab  b2  +  c2 

(12.52) 


The  numerical  procedure  of  decomposing  a  covariance  matrix  into  a  product  such  as 
GGt,  where  G  is  lower  triangular,  is  called  the  Cholesky  decomposition  [Golub  and 
Van  Loan  1983].  Here  we  can  do  so  almost  by  inspection.  We  need  only  equate  the 
elements  of  C w:z  in  (12.51)  to  those  of  GGT  as  given  in  (12.52).  Doing  so  produces 
the  result 


CL  —  O  w 
b  =  poz 
c  —  oz\J  1  -p2. 


Hence,  we  have  that 


In  summary,  to  generate  a 
generate  two  independent 
transform  according  to 


q  _  &W  0 

L  P&Z  crzV1  ~  P2  _ 

realization  of  a  bivariate  Gaussian  random  vector  we  first 
standard  Gaussian  random  variables  X  and  Y  and  then 


'  W  ' 

r 

Z 

L 

aw 


0 

" 

'  x  ' 

+ 

pw 

GZ\J  1  “ 

-p2  ] 

Y 

(12.53) 


As  an  example,  we  let  pw  =  Pz  =  1,  o w  =  gz  =  1,  and  p  —  0.9.  The  constant 
PDF  contours  as  well  as  500  realizations  of  [W  Z]T  are  shown  in  Figure  12.21.  To 
verify  that  the  mean  vector  and  covariance  matrix  are  correct,  we  can  estimate  these 
quantities  using  (9.44)  and  (9.46)  which  are 


Ew,z 


W 

Z 


m=l 


w 

z 


T 


where  [wm  zm]T  is  the  mth  realization  of  [W  Z]T.  The  results  and  the  true  values 
for  M  =  2000  are 


^ _ ^ 

'  W  ' 

'  1.0326  ' 

'  W  ' 

'  1  ' 

Ew,z 

— 

Ew,z 

— 

Z 

_  1.0252 

z 

1 

0.9958  0.9077 
0.9077  1.0166 


w,z  = 


1  0.9 

0.9  1 


The  MATLAB  code  used  to  generate  the  realizations  and  to  estimate  the  mean 
vector  and  covariance  matrix  is  given  next. 
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W 

Figure  12.21:  500  outcomes  of  bivariate  Gaussian  random  vector  with  mean  [1 1]T 
and  covariance  matrix  given  by  (12.40). 


randn(* state*  ,0)  7,  set  random  number  generator  to  initial  value 
G=[l  0;0.9  sqrt (1-0. 9~2)]  ;  7,  define  G  matrix 
M=2000;  7.  set  number  of  realizations 
for  m=l:M 

x=randn(l,l)  ;y=randn(l,l) ;  7*  generate  realizations  of  two 

7*  independent  N(0,1)  random  variables 
wz=G*[x  y]*  +  [l  1]*;  7*  transform  to  desired  mean  and  covariance 
WZ(:,m)=wz;  7«  save  realizations  in  2  x  M  array 
end 

Wmeanest=mean(WZ(l, :))  ;  7*  estimate  mean  of  W 
Zmeanest=mean(WZ(2, : ))  ;  7»  estimate  mean  of  Z 
WZbar(l,  :)=WZ(1,  :)-Wmeanest ;  7*  subtract  out  mean  of  W 
WZbar(2,  :)=WZ(2,  :)-Zmeanest ;  7o  subtract  out  mean  of  Z 
Cest=[0  0;0  0]  ; 
for  m=l:M 

Cest=Cest+(WZbar(  :  ,m)*WZbar( :  ,m)  *)/M;  7*  compute  estimate  of 

7.  covariance  matrix 

end 

Wmeanest  7*  write  out  estimate  of  mean  of  W 
Zmeanest  7.  write  out  estimate  of  mean  of  Z 
Cest  7.  write  out  estimate  of  covariance  matrix 
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12.12  Real-World  Example  -  Optical  Character 
Recognition 

An  important  use  of  computers  is  to  be  able  to  scan  a  document  and  automatically 
read  the  characters.  For  example,  bank  checks  are  routinely  scanned  to  ascertain 
the  account  numbers,  which  are  usually  printed  on  the  bottom.  Also,  scanners  are 
used  to  take  a  page  of  alphabetic  characters  and  convert  the  text  to  a  computer 
file  that  can  later  be  edited  in  a  computer.  In  this  section  we  briefly  describe  how 
this  might  be  done.  A  more  comprehensive  description  can  be  found  in  [Trier,  Jain, 
and  Taxt  1996].  To  simplify  the  discussion  we  consider  recognition  of  the  digits 
0, 1, 2, . . . ,  9  that  have  been  generated  by  a  printer  (as  opposed  to  handwritten,  the 
recognition  of  which  is  much  more  complex  due  to  the  potential  variations  of  the 
characters).  An  example  of  these  characters  is  shown  in  Figure  12.22.  They  were 
obtained  by  printing  the  characters  from  a  computer  to  a  laser  printer  and  then 
scanning  them  back  into  a  computer.  Note  that  each  digit  consists  of  an  80  x  80 


20  40  60  80  20  40  60  80  20  40  60  80  20  40  60  80 


20  40  60  80  20  40  60  80  20  40  60  80  20  40  60  80 


20  40  60  80  20  40  60  80 


Figure  12.22:  Scanned  digits  for  optical  character  recognition. 

array  of  pixels  and  each  pixel  is  either  black  or  white.  This  is  termed  a  binary  image. 
A  magnified  version  of  the  digit  “1”  is  shown  in  Figure  12.23,  where  the  “pixelation” 
is  clearly  evident.  Also,  some  of  the  black  pixels  have  been  omitted  due  to  errors  in 
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the  scanning  process.  In  order  for  a  computer  to  be  able  to  recognize  and  decode 


Figure  12.23:  Magnified  version  of  the  digit  “1”. 


the  digits  it  is  necessary  to  reduce  each  80  x  80  image  to  a  number  or  set  of  numbers 
that  characterize  the  digit.  These  numbers  are  called  the  features  and  they  compose 
a  feature  vector  that  must  be  different  for  each  digit.  This  will  allow  a  computer  to 
distinguish  between  the  digits  and  be  less  susceptible  to  noise  effects  as  is  evident 
in  the  “1”  image.  For  our  example,  we  will  choose  only  two  features,  although  in 
practice  many  more  are  used.  A  typical  feature  based  on  the  geometric  character  of 
the  digit  images  is  the  geometric  moments.  It  attempts  to  measure  the  distribution 
of  the  black  pixels  and  is  completely  analogous  to  our  usual  joint  moments.  (Recall 
our  motivation  of  the  expected  value  using  the  idea  of  the  center  of  mass  of  an  object 
in  Section  11.3.)  Let  g[m,n\  denote  the  pixel  value  at  location  [m,  n]  in  the  image, 
where  m  —  1, 2, . . . ,  80,  n  =  1, 2, . . . ,  80  and  either  g[m ,  n]  =  1  for  a  black  pixel  or 
g[m,n\  =  0  for  a  white  pixel.  Note  from  Figure  12.23  that  the  indices  for  the  [m,  n] 
pixel  are  specified  in  matrix  format,  where  m  indicates  the  row  and  n  indicates  the 
column.  The  geometric  moments  are  defined  as 

n'\k  n  -  £m=i  Enli  mknlg[m,  n] 

PrMJ—  v-^80  0  r  1 

If  we  were  to  define 


dim  n 

p[m,  n]  =  8Q .  ’ - - - -  m  =  1, 2, . . . ,  80;  n  =  1, 2, . . . ,  80 

then  p[m,  n]  would  have  the  properties  of  a  joint  PMF,  in  that  it  is  nonnegative  and 
sums  to  one.  A  somewhat  better  feature  is  obtained  by  using  the  central  geometric 
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moments  which  will  yield  the  same  number  even  as  the  digit  is  translated  in  the 
horizontal  and  vertical  directions.  This  may  be  seen  to  be  of  value  by  referring  to 
Figure  12.22,  in  which  the  center  of  the  digits  do  not  all  lie  at  the  same  location.  Us¬ 
ing  central  geometric  moments  alleviates  having  to  center  each  digit.  The  definition 
is 

=  Em=i  En°=i(m  ~  ™)k{n  ~  n)lg[m,n] 

E^°=iEn°=i  9[m,n] 

where 


n[k,  l ] 


(12.55) 


m  =  /*'[!,  0] 


=  Em=l  En=i  m9[m,  n] 

Em=l  En=i  9[m,  n] 
Em=l  En=l  ng[mi  n] 


80 


n 


=  /*'[<>,  1]  = 


Em=i  En=i9[m,n] 


80 


The  coordinate  pair  (m,  n)  is  the  center  of  mass  of  the  character  and  is  completely 
analogous  to  the  mean  of  the  “joint  PDF”  p[ra,  n]. 

To  demonstrate  the  procedure  by  which  optical  character  recognition  is  accom¬ 
plished  we  will  add  noise  to  the  characters.  To  simulate  a  “dropout”,  in  which  a 
black  pixel  becomes  a  white  one  (see  Figure  12.23  for  an  example),  we  change  each 
black  pixel  to  a  white  one  with  a  probability  of  0.4,  and  make  no  change  with  prob¬ 
ability  of  0.6.  To  simulate  spurious  scanning  marks  we  change  each  white  pixel  to  a 
black  one  with  probability  of  0.1,  and  make  no  change  with  probability  of  0.9.  An 
example  of  the  corrupted  digits  is  shown  in  Figure  12.24.  As  a  feature  vector  we  will 
use  the  pair  (/i[l,  1],  /z[2, 2]).  For  the  digits  “1”  and  “8”,  50  realizations  of  the  feature 
vector  are  shown  in  Figure  12.25a.  The  black  square  indicates  the  center  of  mass 
(m,n)  for  each  digit’s  feature  vector.  Note  that  we  could  distinguish  between  the 
two  characters  without  error  if  we  recognize  an  outcome  as  belonging  to  a  “1”  if  we 
are  below  the  line  boundary  shown  and  as  a  “8”  otherwise.  However,  for  the  digits 
“1”  and  “3”  there  is  an  overlap  region  where  the  outcomes  could  belong  to  either 
character  as  seen  in  Figure  12.25b.  For  these  digits  we  could  not  separate  the  digits 
without  a  large  error.  The  latter  is  more  typically  the  case  and  can  only  be  resolved 
by  using  a  larger  dimension  feature  vector.  The  interested  reader  should  consult 
[Duda,  Hart,  and  Stork  2001]  for  a  further  discussion  of  pattern  recognition  (also 
called  pattern  classification).  Also,  note  that  the  digits  “3”  and  ”8”  would  produce 
outcomes  that  would  overlap  greatly.  Can  you  explain  why?  You  might  consider 
some  typical  scanned  digits  as  shown  in  Figure  12.26  that  have  been  designed  to 
make  recognition  easier! 


/x[2, 2 
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20  40  60  80  20  40  60  80 


Figure  12.24:  Realization  of  corrupted  digits. 


(a)  Digits  1  and  8  (b)  Digits  1  and  3 


Figure  12.25:  50  realizations  of  feature  vector  for  two  competing  digits 
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□  i  ii  a 

Figure  12.26:  Some  scanned  digits  typically  used  in  optical  character  recognition. 
They  were  scanned  into  a  computer,  which  accounts  for  the  obvious  errors. 
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Problems 

12.1  (o)  (w)  For  the  dartboard  shown  in  Figure  12.1  determine  the  probability 
that  the  novice  dart  player  will  land  his  dart  in  the  outermost  ring,  which  has 
radii  3/4  <  r  <  1.  Do  this  by  using  geometrical  arguments  and  also  using 
double  integrals.  Hint:  For  the  latter  approach  convert  to  polar  coordinates 
(r,  6)  and  remember  to  use  dx  dy  =  rdr  d6. 

12.2(c)  Reproduce  Figure  12.2a  by  letting  X  ~  U(— 1,1)  and  Y  ~  U[— 1,1), 
where  X  and  Y  are  independent.  Omit  any  realizations  of  (Y,  Y)  for  which 
YX2  +  Y2  >  1.  Explain  why  this  produces  a  uniform  distribution  of  points  in 
the  unit  circle.  See  also  Problem  13.23  for  a  more  formal  justification  of  this 
procedure. 

12.3  (^)  (w)  For  the  novice  dart  player  is  P[0  <  R  <  0.5]  =  0.5  ( R  is  the  distance 
from  the  center  of  the  dartboard)?  Explain  your  results. 

12.4  (w)  Find  the  volume  of  a  cylinder  of  height  h  and  whose  base  has  radius  r  by 
using  a  double  integral  evaluation. 

12.5  (v^)  (c)  In  this  problem  we  estimate  7 r  using  probability  arguments.  Let  X  ~ 
U(— 1,1)  and  Y  ~  IA{— 1, 1)  for  X  and  Y  independent.  First  relate  P[X2+Y 2  < 
1]  to  the  value  of  i r.  Then  generate  realizations  of  X  and  Y  and  use  them  to 
estimate  it. 
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12.6  (f)  For  the  joint  PDF 


Px,Y(x,y)  = 


\  x2  +  y2  <  1 

0  otherwise 


find  P[|X|  <  1/2].  Hint:  You  will  need 


+  -arcsin(rr). 
z 


12.7  (N^,)  (f)  If  a  joint  PDF  is  given  by 


find  c. 


Px,v{x,y)  = 


Vxy 

0 


0<®<l,0<y<l 

otherwise 


12.8  (w)  A  point  is  chosen  at  random  from  the  sample  space  S  =  {(x,  y)  :  0  <  x  < 
1?0  <  y  <  1}.  Find  P[Y  <  X]. 

12.9  (f)  For  the  joint  PDF  px,y{x,v)  =  exp[—  (x  +  y)]u(x)u(y),  find  P[Y  <  X], 

12.10  (^)  (w,c)  Two  persons  play  a  game  in  which  the  first  person  thinks  of  a 
number  from  0  to  1,  while  the  second  person  tries  to  guess  player  one’s  number. 
The  second  player  claims  that  he  is  telepathic  and  knows  what  number  the 
first  player  has  chosen.  In  reality  the  second  player  just  chooses  a  number 
at  random.  If  player  one  also  thinks  of  a  number  at  random,  what  is  the 
probability  that  player  two  will  choose  a  number  whose  difference  from  player 
one’s  number  is  less  than  0.1?  Add  credibility  to  your  solution  by  simulating 
the  game  and  estimating  the  desired  probability. 

12.11  (o)  (f)  If  (-X",  Y )  has  a  standard  bivariate  Gaussian  PDF,  find  P[X2  +  Y2  = 
10]. 

12.12  (f,c)  Plot  the  values  of  (x,y)  for  which  x 2  —  2 pxy  +  y2  =  1  for  p  =  —0.9, 
p  —  0,  and  p  —  0.9.  Hint:  Solve  for  y  in  terms  of  x. 

12.13  (w,c)  Plot  the  standard  bivariate  PDF  in  three  dimensions  for  p  =  0.9.  Next 
examine  your  plot  if  p  — >  1  and  determine  what  happens.  As  p  — >  1,  can  you 
predict  Y  based  on  X  =  xl 

12.14  (f)  If  px,y(xiV)  =  exp[—  (x  +  y)]u(x)u(y),  determine  the  marginal  PDFs. 

12.15  (0)  (f)  If 

Px,v(x,y)  = 


find  the  marginal  PDFs. 


2  0<#<l,0<y<:r 
0  otherwise 
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12.16  (t)  Assuming  that  (x,y)  ^  (0,0),  prove  that  x 2  —  2 pxy  +  y2  >  0  for  —1  < 

p  <  1. 

12.17  (f)  If  px(x)  =  (1/2)  exp[—(l/2)x]u(x)  and  pY(y)  =  (1/4)  exp[-(l/4)y]^(y), 
find  the  joint  PDF  of  X  and  Y. 

12.18  (^)  (f)  Determine  the  joint  CDF  if  X  and  Y  are  independent  with 

0  <  x  <  2 
otherwise 

0  <  y  <  4 
otherwise. 


12.19  (f)  Determine  the  joint  CDF  corresponding  to  the  joint  PDF 


Px,r(x,y)  = 


xyexp  [—  i(x2  +  y2)]  x>0,y>0 

0  otherwise. 


Next  verify  Properties  12.1-12.6  for  the  CDF. 


12.20  (t)  Prove  that  (12.10)  is  true  if  (12.11)  is  true  and  vice  versa.  Hint:  Let 
A  —  {a  <  x  <b]  and  B  =  {y  :  c  <  y  <  d}  for  the  first  part  and  let  A  =  {x  : 
xq  —  Ax/2  <  x  <  xo  +  Ax/2}  and  B  =  {y  :  yo  —  Ay/2  <  y  <  yo  +  Ay/ 2}  with 
xq  and  yo  arbitrary  for  the  second  part. 

12.21  (t)  Prove  that  (12.11)  and  (12.12)  are  equivalent. 

12.22  (w)  Two  independent  speech  signals  are  added  together.  If  each  one  has  a 
Laplacian  PDF  with  parameter  cr2,  what  is  the  power  of  the  resultant  signal? 

12.23  (^)  (w)  Lightbulbs  fail  with  a  time  to  failure  modeled  as  an  exponential 
random  variable  with  a  mean  time  to  failure  of  1000  hours.  If  two  lightbulbs 
are  used  to  illuminate  a  room,  what  is  the  probability  that  both  bulbs  will  fail 
before  2000  hours?  Assume  that  the  failure  time  of  one  bulb  does  not  affect 
the  failure  time  of  the  other  bulb. 


12.24  (f)  If  a  joint  PDF  is  given  as  Pxx(x^v)  ~  6exp[—  (2x  +  3 y)]u(x)u(y),  what 
is  the  probability  of  A  —  {(#,  y)  :0<x<2,0  <y  <  1}?  Are  the  two  random 
variables  independent? 

12.25  (^)  (w)  A  joint  PDF  is  uniform  over  the  region  {(x,  y)  :  0  <  y  <  x,  0  <  x  < 
1}  and  zero  elsewhere.  Are  X  and  Y  independent? 

12.26  (s^/)  (w)  The  temperature  in  Antarctica  is  modeled  as  a  random  variable 
X  rsj  J\f( 20, 1500)  degrees  Fahrenheit,  while  that  in  Ecuador  is  modeled  also 
as  a  random  variable  with  Y  ~  J\f(  100, 100)  degrees  Fahrenheit.  What  is  the 
probability  that  it  will  be  hotter  in  Antarctica  than  in  Ecuador?  Assume  the 
random  variables  are  independent. 
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12.27  (w,c)  In  Section  2.3  we  discussed  the  outcomes  resulting  from  adding  to¬ 
gether  two  random  variables  uniform  on  (0, 1).  We  claimed  that  the  proba¬ 
bility  of  500  outcomes  in  the  interval  [0, 0.5]  and  500  outcomes  in  the  interval 
[1.5, 2]  resulting  from  a  total  of  1000  outcomes  is 


1000 

500 


1000 


2.2  x  icr604. 


Can  you  now  justify  this  result?  What  assumptions  are  implicit  in  its  calcula¬ 
tion?  Hint:  For  each  trial  consider  the  3  possible  outcomes  (0,0.5),  [0.5, 1.5), 
and  [1.5, 2).  Also,  see  Problem  3.48  on  how  to  evaluate  expressions  with  large 
factorials. 


12.28  (f)  Find  the  PDF  of  X  =  Ui  +  U2,  where  Ux  ~  U{ 0, 1),  U2  ~  U{ 0, 1),  and 
E/i,  U2  are  independent.  Use  a  convolution  integral  to  do  this. 


12.29  (w)  In  this  problem  we  show  that  the  ratio  of  areas  for  the  linear  transfor¬ 
mation 


is  |det(G)|.  To  do  so  let  £  =  [%y]T  take  on  values  in  the  region  {(rrr,  y)  : 
0  <  x  <  1,0  <  y  <  1}  as  shown  by  the  shaded  area  in  Figure  12.27.  Then, 
consider  a  point  in  the  unit  square  to  be  represented  as  £  =  aei  +  /3e2,  where 
0<a<l,0</?<l,  ei  =  [1 0]T,  and  e2  =  [0 1]T.  The  transformed  vector  is 


G(aei  +  /?e2) 
aGei  +  f3Ge2 
a 

OL  I  I  +  /? 


b 

d 


It  is  seen  that  the  natural  basis  vectors  ei,e2  map  into  the  vectors  [ac]T, 
[6<f]T,  which  appear  as  shown  in  Figure  12.27.  The  region  in  the  w-z  plane 
that  results  from  mapping  the  unit  square  is  shown  as  shaded.  The  area  of  the 
parallelogram  can  be  found  from  Figure  12.28  as  BH.  Determine  the  ratio  of 
areas  to  show  that 

Area  in  w-z  plane  _  _  _ 

— - : - - - —  ad  —  be  —  det(G). 

Area  m  x-y  plane 

The  absolute  value  is  needed  since  if  for  example  a  <  0,  b  <  0,  the  parallelo¬ 
gram  will  be  in  the  second  quadrant  and  its  determinant  will  be  negative.  The 
absolute  value  sign  takes  care  of  all  the  possible  cases. 
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Figure  12.27:  Mapping  of  areas  for  linear  transformation. 


z 


7r/2  —  0 

Figure  12.28:  Geometry  to  determine  area  of  parallelogram. 


12.30  (^)  (w,c)  The  champion  dart  player  described  in  Section  12.3  is  able  to 
land  his  dart  at  a  point  (x,  y )  according  to  the  joint  PDF 


1/64  0  ]\ 

0  1/64  J  J 


with  some  outcomes  shown  in  Figure  12.2b.  Determine  the  probability  of  a 
bullseye.  Next  simulate  the  game  and  plot  the  outcomes.  Finally  estimate  the 
probability  of  a  bullseye  using  the  results  of  your  computer  simulation. 
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12.31  (t)  Show  that  (12.19)  can  be  written  as  (12.20). 


12.32  (t)  Consider  the  nonlinear  transformation  w  =  g(x,y),z  =  h(x,  y).  Use  a 
tangent  approximation  to  both  functions  about  the  point  (®o,yo)  to  express 
[wz]T  as  an  approximate  affine  function  of  [xy]T,  and  use  matrix/ vector  no¬ 
tation.  For  example, 


w  =  g(x,y)  «  g(x0,yo)  + 


dg_ 

dx 


,  \  & 9 

(x  -  x°) +  Th, 

x=x0  uy 

y=y  o 


(y  -  yo) 


X  =  XQ 

y=yo 


and  similarly  for  z  =  h(x,y).  Compare  the  matrix  to  the  Jacobian  matrix  of 
(12.23). 

12.33(f)  If  a  joint  PDF  is  given  as  Px,y(xiV)  =  (1/4)2  exp[— |(|x|  +  \y\)]  for 
— oo  <  x  <  oo,  — oo  <  y  <  oo,  find  the  joint  PDF  of 


'  w ' 

'  2 

2  ' 

'  X  ' 

z 

2 

1  _ 

Y 

12.34  (f)  If  a  joint  PDF  is  given  as  Px,y(xiV)  ~  exp[—  (x  +  y)]u(x)u(y),  find  the 
joint  PDF  oiW  =  XY,Z  =  Y/X.  ’ 

12.35  (w,c)  Consider  the  nonlinear  transformation 

w  =  x2  +  5y2 

z  -  -5x2  +  y2. 


Write  a  computer  program  to  plot  in  the  x-y  plane  the  points  ( Xi,yj )  for 
xi  =  0.95  +  (i  —  1) / 100  for  i  =  1, 2, . . . ,  11  and  yj  =  1.95  +  (j  —  1)/ 100  for 
j  =  1,2,...,  11.  Next  transform  all  these  points  into  the  w-z  plane  using 
the  given  nonlinear  transformation.  What  kind  of  figure  do  you  see?  Next 
calculate  the  area  of  the  figure  (you  can  use  a  rough  approximation  based  on 
the  computer  generated  figure  output)  and  finally  take  the  ratio  of  the  areas 
of  the  figures  in  the  two  planes.  Does  this  ratio  agree  with  the  Jacobian  factor 


det 


d(w,  z) 
d(x,y) 


when  evaluated  at  x  =  1,  y  =  2? 


12.36  (^)  (f)  Find  the  marginal  PDFs  of  the  joint  PDF  given  in  (12.25). 

12.37  (f)  Determine  the  marginal  PDFs  for  the  joint  PDF  given  by 


X 

Y 
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12.38  (0)  (f)  If  X  and  Y  have  the  joint  PDF 


find  the  joint  PDF  of  the  transformed  random  vector 


'  w  ■ 

'  1 

1  ' 

‘  X  ' 

z 

_  2 

3 

Y 

12.39  (t)  Prove 
given  by 


that  the  PDF  of  Z  —  Y/X,  where  X  and  Y  axe  independent,  is 

/OO 

px(x)pY{xz)\x\dx. 

-OO 


12.40  (t)  Prove  that  the  PDF  of  Z  =  XY,  where  X  and  Y  are  independent  is  given 
by 

f°°  1 

pz{z)  =  /  px(x)pY(z/x)ridx. 

J  —  OO  1*^1 

12.41  (c)  Generate  outcomes  of  a  Cauchy  random  variable  using  Y/X ,  where  X  ~ 
A/"(0, 1),  Y  ~  Af( 0, 1)  and  X  and  Y  are  independent.  Can  you  explain  what 
happens  when  the  Cauchy  outcome  becomes  very  large  in  magnitude? 


12.42  (t)  Prove  that  s(t)  =  Acos(27rFot)  4-  B  sin(27ri?ot)  can  be  written  as  s(t)  = 
y/A2  +  B2  cos(27rFot  —  arctan  ( B/A )).  Hint:  Convert  (A,B)  to  polar  coordi¬ 
nates. 


12.43  (^)  (w)  A  particle  is  subject  to  a  force  in  a  random  force  field.  If  the 
velocity  of  the  particle  is  modeled  in  the  x  and  y  directions  as  Vx  ~  A/*(0, 10) 
and  Vy  ~  0, 10)  meters/sec,  and  Vx  and  Vy  are  assumed  to  be  independent, 

how  far  will  the  particle  move  on  the  average  in  1  second? 


12.44  (f)  Prove  that  if  X  and  Y  are  independent  standard  Gaussian  random  vari¬ 
ables,  then  X2  +  Y2  will  have  a  PDF- 


12.45  (o)  (w?f )  Two  independent  random  variables  X  and  Y  have  zero  means  and 
variances  of  1.  If  they  are  linearly  transformed  as  W  =  X  +  Y,Z  —  X  —  Y, 
find  the  covariance  between  the  transformed  random  variables.  Are  W  and  Z 
uncorrelated?  Are  W  and  Z  independent? 


12.46(f)  If 


determine  the  mean  of  X  +  Y  and  the  variance  of  X  +  Y . 
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12.47  (^)  (w)  The  random  vector  [. XY]T  has  a  covariance  matrix 


1 

2 


Find  a  2  x  2  matrix  G  so  that  G[X  Y]T  is  a  random  vector  with  uncorrelated 
components. 


12.48  (t)  Prove  that  if  a  random  vector  has  a  covariance  matrix 


then  the  matrix 


can  always  be  used  to  diagonalize  it.  Show  that  the  effect  of  this  matrix 
transformation  is  to  rotate  the  point  (x,  y)  by  45°  and  relate  this  back  to  the 
contours  of  a  standard  bivariate  Gaussian  PDF. 


12.49  (f)  Find  the  MMSE  estimator  of  Y  based  on  observing  X  =  x  if  ( X,Y )  has 
the  joint  PDF 


Px,r(x,y)  = 


1 

2tt\/019 


exp 


1-8  xy  +  y2) 


A 


Also,  find  the  PDF  of  the  error  Y  —  Y  =  Y  —  ( aX  +  6),  where  a,  6  are  the 
optimal  values.  Hint:  See  Theorem  12.7.1. 


12.50  (w,c)  A  random  signal  voltage  V  ~  A/"(l,  1)  is  corrupted  by  an  independent 
noise  sample  AT,  where  N  ~  Af( 0,2),  so  that  V  +  N  is  observed.  It  is  desired 
to  estimate  the  signal  voltage  as  accurately  as  possible  using  a  linear  MMSE 
estimator.  Assuming  that  V  and  N  are  independent,  find  this  estimator.  Then 
plot  the  constant  PDF  contours  for  the  random  vector  (V  +  AT,  V)  and  indicate 
the  estimated  values  on  the  plot. 

12.51  (f )  Using  a  convolution  integral  prove  that  if  X  and  Y  are  independent  stan¬ 
dard  Gaussian  random  variables,  then  X  +  Y  V(  0,2). 

12.52  (0)  (f)  If 

'  X 
Y 


find  P[X  +  Y>  2]. 
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12.53  (t)  Prove  that  if  X  ~  r(ax,A)  and  Y  ~  r(ay,  A)  and  X  and  Y  are  inde¬ 
pendent,  then  X  +  Y  ~  r(o;x  +  &Y ,  A). 

12.54  (f )  To  generate  two  independent  standard  Gaussian  random  variables  on  a 
computer  one  can  use  the  Box-Mueller  transform 

X  =  V-21nC/icos(27rC/2) 
y  =  y/—2  In  U\  sin(27r[/2) 

where  ?7i,  U<i  are  both  uniform  on  (0, 1)  and  independent  of  each  other.  Prove 
that  this  result  is  true.  Hint:  To  find  the  inverse  transformation  use  a  polar 
coordinate  transformation. 


12.55  (t)  Prove  that  by  using  the  eigendecomposition  of  a  covariance  matrix  or 
VTCV  -  A  that  one  can  factor  C  as  C  =  GGt,  where  G  =  Vv/A,  and  VK 
is  defined  as  the  matrix  obtained  by  taking  the  positive  square  roots  of  all  the 
elements.  Recall  that  A  is  a  diagonal  matrix  with  positive  elements  on  the 
main  diagonal.  Next  find  G  for  the  covariance  matrix 


26  6 
6  26 


and  verify  that  GGT  does  indeed  produce  C. 

12.56  (c)  Simulate  on  the  computer  realizations  of  the  random  vector 


Plot  these  realizations  as  well  as  the  contours  of  constant  PDF  on  the  same 
graph. 


Chapter  13 


Conditional  Probability  Density 
Functions 

13.1  Introduction 

A  discussion  of  conditional  probability  mass  functions  (PMFs)  was  given  in  Chapter 
8.  The  motivation  was  that  many  problems  are  stated  in  a  conditional  format  so 
that  the  solution  must  naturally  accommodate  this  conditional  structure.  In  addi¬ 
tion,  the  use  of  conditioning  is  useful  for  simplifying  probability  calculations  when 
two  random  variables  are  statistically  dependent.  In  this  chapter  we  formulate  the 
analogous  approach  for  probability  density  functions  (PDFs).  A  potential  stum¬ 
bling  block  is  that  the  usual  conditioning  event  X  —  x  has  probability  zero  for  a 
continuous  random  variable.  As  a  result  the  conditional  PMF  cannot  be  extended 
in  a  straightforward  manner.  We  will  see,  however,  that  using  care,  a  conditional 
PDF  can  be  defined  and  will  prove  to  be  useful. 


13.2  Summary 

The  conditional  PDF  is  defined  in  (13.3)  and  can  be  used  to  find  conditional  proba¬ 
bilities  using  (13.4).  The  conditional  PDF  for  a  standard  bivariate  Gaussian  PDF  is 
given  by  (13.5)  and  is  seen  to  retain  its  Gaussian  form.  The  joint,  conditional,  and 
marginal  PDFs  are  related  to  each  other  as  summarized  by  Properties  13.1-13.5.  A 
conditional  CDF  is  defined  by  (13.6)  and  is  evaluated  using  (13.7).  The  use  of  condi¬ 
tioning  can  simplify  probability  calculations  as  described  in  Section  13.5.  A  version 
of  the  law  of  total  probability  is  given  by  (13.12)  and  is  evaluated  using  (13.13).  An 
optimal  predictor  for  the  outcome  of  a  random  variable  based  on  the  outcome  of  a 
second  random  variable  is  given  by  the  mean  of  the  conditional  PDF  as  defined  by 
(13.14).  An  example  is  given  for  the  bivariate  Gaussian  PDF  in  which  the  predictor 
becomes  linear  (actually  affine).  To  generate  realizations  of  two  jointly  distributed 
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continuous  random  variables  the  procedure  based  on  conditioning  and  described  in 
Section  13.7  can  be  used.  Lastly,  an  application  to  determining  mortality  rates  for 
retirement  planning  is  described  in  Section  13.8. 


13.3  Conditional  PDF 


Recall  that  for  two  jointly  discrete  random  variables  X  and  Y,  the  conditional  PMF 
is  defined  as 


PY\x[Vj\xi] 


Pxy\xj,yj\ 

Px[xi\ 


1,2,... 


(13.1) 


This  formula  gives  the  probability  of  the  event  Y  =  yj  for  y  =  1,2,...  once  we  have 
observed  that  X  =  xt .  Since  X  —  X{  has  occurred,  the  only  joint  events  with  a 
nonzero  probability  are  {(x,y)  :  x  =  Xi,y  =  2/1 , 2/2 ,  •  ••}•  As  a  result  we  divide  the 
joint  probability  px,y[xi,  Vj]  =  P[X  =  Xi,Y  =  yj ]  by  the  probability  of  the  reduced 
sample  space,  which  is  px[xi]  =  P[X  =  Xi,Y  =  y{\  +  P[X  =  Xi,Y  =  2/2]  +  •  *  •  = 
Yl’jL  1  Px,y[xi  ,  y;j]  ■  This  division  assures  us  that 


£j*i*IwN 

3= 1 


UU  r  1 

PX,Y[XuVj] 

px[xi] 

YfjL\Px,Y\xj,yj] 
Px  [xi] 

Y,jLiPx,Y[xj,yj] 

YrjL\PX,Y[Xi,yj] 


In  the  case  of  continuous  random  variables  X  and  Y  a  problem  arises  in  defining 
a  conditional  PDF.  If  we  observe  X  =  x,  then  since  P[X  =  x]  =  0,  the  use  of  a 
formula  like  (13.1)  is  no  longer  valid  due  to  the  division  by  zero.  Recall  that  our 
original  definition  of  the  conditional  probability  is 


P[A\B]  = 


P[A  n  B] 
P[B ] 


which  is  undefined  if  P[B ]  =  0.  How  should  we  then  proceed  to  extend  (13.1)  for 
continuous  random  variables? 

We  will  motivate  a  viable  approach  using  the  example  of  the  circular  dartboard 
described  in  Section  12.3.  In  particular,  we  consider  a  revised  version  of  the  dart 
throwing  contest.  Referring  to  Figure  12.2  the  champion  dart  player  realizes  that  the 
novice  presents  little  challenge.  To  make  the  game  more  interesting  the  champion 
proposes  the  following  modification.  If  the  novice  player’s  dart  lands  outside  the 
region  |r|  <  Ax/2,  then  the  novice  player  gets  to  go  again.  He  continues  until  his 
dart  lands  within  the  region  |x|  <  Ax/2  as  shown  cross-hatched  in  Figure  13.1a. 
The  novice  dart  player  even  gets  to  pick  the  value  of  Ax.  Hence,  he  reasons  that  it 
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go  again 


go  again 


—Ax/2 


X 


(a)  Dartboard 


1  radius  =  1/4 
(b)  Sample  space 


Figure  13.1:  Revised  dart  throwing  game.  Only  dart  outcomes  in  the  cross-hatched 
region  are  counted. 


should  be  small  to  exclude  regions  of  the  dart  board  that  are  outside  the  bullseye 
circle.  As  a  result,  he  chooses  a  Ax  as  shown  in  Figure  13.1b,  which  allows  him 
to  continue  throwing  darts  until  one  lands  within  the  cross-hatched  region.  The 
champion,  however,  has  taken  a  course  in  probability  and  so  is  not  worried.  In  fact, 
in  Problem  12.30  the  probability  of  the  champion’s  dart  landing  in  the  bullseye  area 
was  shown  to  be  0.8646.  To  find  the  probability  of  the  novice  player  obtaining  a 
bullseye,  we  recall  that  his  dart  is  equally  likely  to  land  anywhere  on  the  dartboard. 
Hence,  using  conditional  probability  we  have  that 

P[bullseye|  -  Ax/2  <  X  <  Ax/2]  =  ~ Al/2  ^  , 

Since  Ax/ 2  is  small,  we  can  assume  that  it  is  much  less  than  1/4  as  shown  in  Figure 
13.1b.  Therefore,  we  have  that  the  cross-hatched  regions  can  be  approximated  by 
rectangles  and  so 

P[bullseye|  —  Ax/2  <  X  <  Ax/2] 


i>[double  cross-hatched  region] 


P  [double  cross-hatched  region]  +  Pfsingle  cross-hatched  region] 
Ax{1/2)/'k 


■  Ax{2)/'k 
=  0.25  <  0.865. 


(probability  =  rectangle  area/dartboard  area) 


(13.2) 


Hence,  the  revised  strategy  will  still  allow  the  champion  to  have  a  higher  probability 
of  winning  for  any  Ax,  no  matter  how  small  it  is  chosen.  Even  though  P[X  =  0]  =  0, 
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the  conditional  probability  is  well  defined  even  as  Ax  0  (but  not  equal  0).  Some 
typical  outcomes  of  this  game  are  shown  in  Figure  13.2,  where  it  is  assumed  the 
novice  player  has  chosen  Ax/2  =  0.2.  In  Figure  13.2a  are  shown  the  outcomes  of  X 


0  20  40  60  80  1 00 

Trial  number 


(a)  All  x  outcomes  -  those  with  |a;|  <  0.2  are  shown  as  dark 
lines 


0  20  40  60  80  1 00 

Trial  number 


(b)  Dark  lines  are  y  outcomes  for  which  |x|  <  0.2,  b  indicates 
a  bullseye  (y/x2  -f-  y2  <  1/4)  for  the  outcomes  with  |a;|  <  0.2 


Figure  13.2:  Revised  dart  throwing  game  outcomes. 
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for  the  novice  player.  Only  those  for  which  \x\  <  Ax/2  —  0.2,  which  are  shown  as 
the  darker  lines,  are  kept.  In  Figure  13.2b  the  outcomes  of  Y  are  shown  with  the 
kept  outcomes  shown  as  the  dark  lines.  Those  outcomes  that  resulted  in  a  bullseye 
are  shown  with  a  “b”  over  them.  Note  that  there  were  27  out  of  100  outcomes  that 
had  |#|  values  less  than  or  equal  to  0.2  (see  Figure  13.2a),  and  of  these,  8  outcomes 
resulted  in  a  bullseye  (see  Figure  13.2b).  Hence,  the  estimated  probability  of  landing 
in  either  the  single  or  double  cross-hatched  region  of  Figure  13.1b  is  27/100  =  0.27, 
while  the  theoretical  probability  is  approximately  Ax(2)/n  =  0.4(2)/7r  =  0.254. 
Also,  the  estimated  conditional  probability  of  a  bullseye  is  from  Figure  13.2b,  8/27  = 
0.30  while  from  (13.2)  the  theoretical  probability  is  approximately  equal  to  0.25. 
(The  approximations  are  due  to  the  use  of  rectangular  approximations  to  the  cross- 
hatched  regions,  which  only  become  exact  as  A#  — >  0.)  We  will  use  the  same 
strategy  to  define  a  conditional  PDF.  Let  A  =  {(x,y)  :  y/x2  +  y2  <  1/4},  which  is 
the  bullseye  region.  Then 

P[A\  \X\  <  Ax/2] 


P[A,  \X\  <  Ax/2 
P[\X\  <  Ax/2] 

P[{(x,  y )  :  M  <  Ax/2,  |y|  <  \/l/16  -  x 2}] 
P[{{x,y)  :  |a:|  <  Ax/2,  |y|  <  1}] 

P[{(x,  y)  :  |a:|  <  Ax/2,  \y\  <  >/l/16  -  x2}] 
P[{x  :  \x\  <  Ax/2}] 

/-Ax/2  ^//TJ^PX'y{X'  V)dy  dX 

rAa;/2 


f-Ax/2Px(X)dx 


As  Ax  ->  0,  we  can  write 


P[A\  \X\  <  Ax/2] 


(definition  of  cond.  prob.) 

/  double  cross-hatched  area 
\  cross-hatched  area 


/-Ax/2  J— 1/4  Px,y{x,  y)dy  dx 
f-Ax/2  Px  (x)dx 

f-i/iPx,Y(0,y)Axdy 

px(0)Ax 

[ 1/4 

J- 1/4  Px{  0) 


(since  y/l/16  —  x2  «  1/4  for  |a;|  <  Ax/2) 


(sine epx,y{x,y)  ~Pxy{ 0,y)  for  \x\  <  Ax/2) 


We  now  define  px,Y  /px  as  the  conditional  PDF 

/  -  x  Px,Y{x,y) 


(13.3) 


438  CHAPTER  13.  CONDITIONAL  PROBABILITY  DENSITY  FUNCTIONS 


Note  that  it  is  well  defined  as  long  as  px(x)  7^  0.  Thus,  as  Ax  — >  0 

fl/4 

P[A\  \X\  <  Ax/2]  =  J  PY\x(y\0)dy. 

More  generally,  the  conditional  PDF  allows  us  to  compute  probabilities  as  (see 
Problem  13.6) 

P[a  <Y  <  b\x  —  Ax/2  <  X  <  x  +  Ax/2]  =  f  pY\x(y\x)dy. 

J  a 

This  probability  is  usually  written  as 


P[a<Y  <  b\X  =  x 


but  the  conditioning  event  should  be  understood  to  be  {x  :  x—Ax/2  <  X  <  x+Ax/2} 
for  Ax  small.  Finally,  with  this  understanding,  we  have  that 

P[a<Y  <  b\X  =  x]  =  (  pY\x(v\x)dy  (13.4) 

J  a 

where  py\x  IS  defined  by  (13.3)  and  is  termed  the  conditional  PDF .  The  condi¬ 
tional  PDF  py\x(y\x )  is  the  probability  per  unit  length  of  Y  when  X  —  x  (actually 
x  —  Ax/2  <  X  <  x  +  Ax/2 )  is  observed.  Since  it  is  found  using  (13.3),  it  is  seen 
to  be  a  function  of  y  and  x.  It  should  be  thought  of  as  a  family  of  PDFs  with  y  as 
the  independent  variable,  and  with  a  different  PDF  for  each  value  x.  An  example 
follows. 

Example  13.1  -  Standard  bivariate  Gaussian  PDF 

Assume  that  (X,Y)  have  the  joint  PDF 


px,v(x^y)  = 


i 


27r-\/r^-p2 


exp 


1 


L  2(1  -p2) 


(x2  -  2 pxy  +  y2) 


— OO  <  X  <  00 

— oo  <  y  <  oo 


and  note  that  the  marginal  PDF  for  X  is  given  by 


Px{x) 


\/2tt 


exp 


1  . 
-r 


The  conditional  PDF  is  found  from  (13.3)  as 


exp 


(x2  -  2 pxy  +  y2) 


2(1-P2) 


^  exp  [—lx2] 
6x11 


py\x(y\x) 
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where 


x 2  —  2  pxy  +  y2  (1  —  p2)x2 

1  —  p2  1  —  p2 

y 2  —  2  pxy  +  p2x2 

1-p2 

(y  -  pxf 

1-p2  ' 


As  a  result  we  have  that  the  conditional  PDF  is 


PY\x(y\%) 


i/2tt(1  -  p2) 


exp 


2(1  -  p2) 


(y  -  pz): 


(13.5) 


and  is  seen  to  be  Gaussian.  This  result,  although  of  great  importance,  is  not  true 
in  general.  The  form  of  the  PDF  usually  changes  from  py(y)  to  Py\x{y\x)-  We  will 
denote  this  conditional  PDF  in  shorthand  notation  as  Y\(X  =  x)  ~  J\f(px ,  1  —  p2). 
As  expected,  the  conditional  PDF  depends  on  x,  and  in  particular  the  mean  of  the 
conditional  PDF  is  a  function  of  x.  It  is  a  valid  PDF  in  that  for  each  x  value ,  it  is 
nonnegative  and  integrates  to  1  over  — oo  <  y  <  oo.  These  properties  are  true  in 
general.  In  effect,  the  conditional  PDF  depends  on  the  outcome  of  X  so  that  we  use  a 
different  PDF  for  each  X  outcome.  For  example,  if  p  =  0.9  and  we  observe  X  =  —  1, 
then  to  compute  P[— 1  <  Y  <  — 0.8|X  =  —1]  and  P[— 0.1  <  Y  <  0.1\X  —  —1],  we 
first  observe  from  (13.5)  that  Y\(X  =  —1)  ~  J\f(— 0.9, 0.19).  Then 


P[-l  <  Y  <  -0.8|X  =  -1 


P[-0.1  <  Y  <  0.1\X  =  -1] 


Q 

Q 


—  1  —  (— 0.9)\  /— 0.8  —  (— 0.9)\ 

V09  )  VV  V0A9  ) 

-0.1  -  (-0.9)  \  / 0.1  -  (-0.9)  \ 

V0l9  )  gV  ) 


=  0.1815 
=  0.0223. 


Can  you  explain  the  difference  between  these  values?  (See  Figure  13.3b  where  the 
dark  lines  indicate  y  =  0  and  y  =  —0.9.)  In  Figure  13.3b  the  cross-section  of  the 
joint  PDF  is  shown.  Once  the  cross-section  is  normalized  so  that  it  integrates  to 
one,  it  becomes  the  conditional  PDF  py\x{y\  ~  !)•  This  is  easily  verified  since 


py\x(v\  - 1)  = 


Px,Y{-l,y) 

Px{- 1) 

px,y(-l,y) 

IZoPx,Y{-l,y)dy' 


0 


440  CHAPTER  13.  CONDITIONAL  PROBABILITY  DENSITY  FUNCTIONS 


(a)  (b) 

Figure  13.3:  Standard  bivariate  Gaussian  PDF  and  its  cross-section  at  x  =  —1.  The 
normalized  cross-section  is  the  conditional  PDF. 

13.4  Joint,  Conditional,  and  Marginal  PDFs 

The  relationships  between  the  joint,  conditional,  and  marginal  PMFs  as  described 
in  Section  8.4  also  hold  for  the  corresponding  PDFs.  Hence,  we  just  summarize  the 
properties  and  leave  the  proofs  to  the  reader  (see  Problem  13.11). 

Property  13.1  —  Joint  PDF  yields  conditional  PDFs. 


py\x{v\x)  = 
Px\y{x\v)  = 


Px,y(x,v) 
f-O0Px,Y(x,y)dy 
Px^jx/y) 
I-oo  Px,y{x,y)dx 


□ 


Property  13.2  —  Conditional  PDFs  are  related. 


Px\v{x\y) 


Py\x(v\x)px(x) 

Py(v) 


□ 
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Property  13.3  -  Conditional  PDF  is  expressible  using  Bayes’  rule. 

,  ,  x  Px\Y(x\y)PY{y ) 

PY\xyx  f^°0opX\Y(x\y)PY(y)dy 

□ 

Property  13.4  -  Conditional  PDF  and  its  corresponding  marginal  PDF 
yields  the  joint  PDF 

px,y{x,v)  =  PY\x(y\x)px(x)  =  Px\Y(x\y)PY(y) 


□ 

Property  13.5  —  Conditional  PDF  and  its  corresponding  marginal  PDF 
yields  the  other  marginal  PDF 

/oo 

PY\x(y\x)px(x)dx 

-oo 


□ 

A  conditional  CDF  can  also  be  defined.  Based  on  (13.4)  we  have  upon  letting 
a  —  — oo  and  b  =  y 

P[Y  <  y\X  =  x}=  f  pY\x{t\x)dt. 

J  — OO 

As  a  result  the  conditional  CDF  is  defined  as 

Fy\x(v\x)  =  P[Y  <  y\X  =  x ]  (13.6) 

and  is  evaluated  using 

Fy\x(y\x)  =  [  pY\x(t\z)dt.  (13.7) 

J  — OO 


As  an  example,  if  Y\(X  =  x)  ~  Af(px,  1  —  p2)  as  was  shown  in  Example  13.1,  we 
have  that 

FY\x{y\x)  =  i  ~  Q  f  ^  P  p*\  (13-8) 

Finally,  as  previously  mentioned  in  Chapter  12  two  continuous  random  variables 
X  and  Y  are  independent  if  and  only  if  the  joint  PDF  factors  as  px,y(x,y )  = 
Px{x)pY(y)  or  equivalently  if  the  joint  CDF  factors  as  Fx,y{x,y )  —  Fx(x)FY{y). 
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This  is  consistent  with  our  definition  of  the  conditional  PDF  since  if  X  and  Y  are 
independent 


py\x(v\x)  = 


pxxi^y) 

Px(x) 

px(x)py(v )  „ 

- 7-v —  =py{y) 

Px{x) 


(13.9) 


(and  similarly  Px\y  —  Px)-  Hence,  the  conditional  PDF  no  longer  depends  on  the 
observed  value  of  X,  i.e.,  x.  This  means  that  the  knowledge  that  X  =  x  has  occurred 
does  not  affect  the  PDF  of  Y  (and  thus  does  not  affect  the  probability  of  events 
defined  on  <Sy).  Similarly,  from  (13.7),  if  X  and  Y  are  independent 


Fy\x(y\x)  = 


ry 

1  PY\x(t\x)dt 

—  OO 

ry 

PY(t)dt  (from  (13.9)) 

— OO 


Fy(v)- 


An  example  would  be  if  p  =  0  for  the  standard  bivariate  Gaussian  PDF.  Then  since 
Y\(X  =  x)  ~  Af(px,  1  —  p2)  =  Af( 0, 1),  we  have  that  PY\x(y\x)  =  PY(y)-  Also,  from 
(13.8) 


FY\x{y\x)  = 


y-  px 


1  e,v^r? 

1  -Q(y)  =  FY(y). 


Another  example  follows. 

Example  13.2  -  Lifetime  PDF  of  spare  lightbulb 

A  professor  uses  the  overhead  projector  for  his  class.  The  time  to  failure  of  a  new 
bulb  X  has  the  exponential  PDF  px(x)  —  Aexp(— Arr)u(#),  where  x  is  in  hours.  A 
new  spare  bulb  also  has  a  time  to  failure  Y  that  is  modeled  as  an  exponential  PDF. 
However,  the  time  to  failure  of  the  spare  bulb  depends  upon  how  long  the  spare 
bulb  sits  unused.  Assuming  the  spare  bulb  is  activated  as  soon  as  the  original  bulb 
fails,  the  time  to  activation  is  given  by  X.  As  a  result,  the  expected  time  to  failure 
of  the  spare  bulb  is  decreased  as 


_  1 

Ay  A(1  +  ax) 

where  0  <  a  <  1  is  some  factor  that  indicates  the  degradation  of  the  unused  bulb 
with  storage  time.  The  expected  time  to  failure  of  the  spare  bulb  decreases  as  the 
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Q  I  i  I  . . . I - 1 _ I _ i  — — —  I 

-2  -1  0  1  2  3  4  5 

y  (hours) 


Figure  13.4:  Conditional  PDF  for  lifetime  of  spare  bulb.  Dependence  is  on  time  to 
failure  x  of  original  bulb. 


original  bulb  is  used  longer  (and  hence  the  spare  bulb  must  sit  unused  longer).  Thus, 
we  model  the  time  to  failure  of  the  spare  bulb  as 

PY\x(y\x)  =  Ayexp(— Ayy)u(y) 

=  A(1  +  ax)  exp  [— A(1  +  ax)y ]  u{y). 

This  conditional  PDF  is  shown  in  Figure  13.4  for  1/A  =  5  hours  and  a  =  0.5.  We 
now  wish  to  determine  the  unconditional  PDF  of  the  time  to  failure  of  the  spare 
bulb  which  is  py(y)-  It  is  expected  that  the  probability  of  failure  of  the  spare  bulb 
will  increase  than  if  the  spare  bulb  were  used  rightaway  or  for  x  =  0.  Note  that  if 
x  =  0,  then  py\x  —  Px ?  which  says  that  the  spare  bulb  will  fail  with  the  same  PDF 
as  the  original  bulb.  Using  Property  13.5  we  have 


/oo 

PY\x{y\x)Px{x)dx 

-OO 


-  f 


— oo 
oo 


A(1  +  ax)  exp  [— A(1  +  ax)y]  Aexp(— A x)dx 


POO 

A2  exp(— Ay)  /  (1  +  ax)  exp  [— A(ay  +  l)x]  dx 

Jo 


=  A2  exp(-Ay) 


=  A2  exp(-Ay) 


=  A2  exp(-Ay) 
=  A2  exp(-Ay) 


poo  poo 

/  exp (ax)dx  +  a  /  xexp(ax)dx 

Jo  Jo 

exp  (ax) 


(let  a  =  — A(1  +  ay)) 


a 

1  a 

- ^  ~2 

a  a/ 

1 


oo 


0 


+  a 


a;  exp  (ax)  — \exp(ax) 


a 


a 4 


ooi 


0  J 


+ 


a 


[X(ay  +  1)  [X(ay  +  1)]2J  ‘ 
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or  finally 


Py(v)  =  A2exp(— A  y) 


+ 


a 


u(y). 


_\(ay  +  1)  '  [X(ay  +  1)]2_ 

This  is  shown  in  Figure  13.5  for  1/ A  =  5  hours  and  a  —  0.5  along  with  the  PDF 
Px(x)  of  the  time  to  failure  of  the  original  bulb.  As  expected  the  probability  of  the 


Figure  13.5:  PDFs  for  time  to  failure  of  original  bulb  X  and  spare  bulb  Y. 


spare  bulb  failing  before  2  hours  is  greatly  increased. 

Finally,  note  that  the  conditional  PDF  is  obtained  by  differentiating  the  conditional 
CDF.  From  (13.7)  we  have 


py\x(v\x)  = 


dFY\x(y\x) 

dy 


(13.10) 


13.5  Simplifying  Probability  Calculations 
Using  Conditioning 

Following  Section  8.5  we  can  easily  find  the  PDF  of  Z  =  g(X,Y)  if  X  and  Y  are 
independent  by  using  conditioning.  We  shall  not  repeat  the  argument  other  than  to 
summarize  the  results  and  give  an  example.  The  procedure  is 

1.  Fix  X  —  x  and  let  Z\{X  —  x)  —  g{x ,  Y). 

2.  Find  the  PDF  of  Z\(X  —  x)  using  the  standard  approach  for  a  transformation 

of  a  single  random  variable  from  Y  to  Z . 

3.  Uncondition  the  conditional  PDF  to  yield  the  desired  PDF  pz{z). 
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Example  13.3  —  PDF  for  ratio  of  independent  random  variables 

Consider  the  function  Z  =  Y/X  where  X  and  Y  are  independent  random  variables. 
In  Problem  12.39  we  asserted  that 

/OO 

px(x)pY(xz)\x\dx. 

-OO 


We  now  derive  this  using  the  aforementioned  approach.  First  recall  that  if  Z  =  aY 
for  a  a  constant,  then  pz(z)  =  PY(z/a)/\a\  (see  Example  10.5).  Now  we  have  that 


Z\{X  =  x)=^ 


(X  =  x)  = 


Y 


x 


(X  =  x) 


so  that  with  a  =  1/x  and  noting  the  independence  of  X  and  Y,  we  have 

Pz\x{z\x)  —  Py\x(zx)\x\  =py(zx)\x\  (13.11) 


and  thus 


Pz(z) 


Pz,x(z,x)dx 
Pz\x(z\x)px{x)dx 
PY(zx)\x\px{x)dx 
Px  (x)py  (xz) \x\dx. 


(marginal  PDF  from  joint  PDF) 
(definition  of  conditional  PDF) 
(from  (13.11)) 


Note  that  without  the  independence  assumption,  we  could  not  assert  that  Py\x  =  Py 
in  (13.11). 

0 

In  general  to  compute  probabilities  of  events  it  is  advantageous  to  use  conditioning 
arguments  whether  or  not  X  and  Y  are  independent.  The  analogous  result  to  (8.28) 
is  (see  Problem  13.15) 

/OO 

P[Y  G  A\X  =  x]px{x)dx.  (13.12) 

-oo 


This  is  another  form  of  the  theorem  of  total  probability.  It  can  also  be  written  as 


P[Y  e  A]  -  J 


OO 


oo  L  J  A 


PY\x(y\x)dy 


px{x)dx 


(13.13) 


where  we  have  used  (13.4)  and  replaced  {y  :  a  <  y  <  b}  by  the  more  general  set 
A.  The  formula  of  (13.13)  is  analogous  to  (8.27)  for  discrete  random  variables.  An 
example  follows. 
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Example  13.4  -  Probability  of  error  for  a  digital  communication  system 

Consider  the  PSK  communication  system  shown  in  Figure  2.14.  The  probability  of 
error  was  shown  in  Section  10.6  to  be 


Pe  =  P[W  <  -A/2]  =  Q(A/2) 


since  the  noise  sample  W  ~  Af( 0,1).  In  a  wireless  communication  system  such 
as  is  used  in  cellular  telephone,  the  received  amplitude  A  varies  with  time  due  to 
multipath  propagation  [Rappaport  2002].  As  a  result,  it  is  usually  modeled  as  a 
Rayleigh  random  variable  whose  PDF  is 


pA(a)  = 


^exp(-2^a2)  a^° 

0  a  <  0. 


We  wish  to  determine  the  probability  of  error  if  A  is  a  Rayleigh  random  variable. 
Thus,  we  need  to  evaluate  P[W  +  A/2<0]ifW~  Af(  0, 1),  A  is  a  Rayleigh  random 
variable,  and  we  assume  W  and  A  are  independent.  A  straightforward  approach  is 
to  first  find  the  PDF  of  Z  =  W  +  A/2,  and  then  to  integrate  pz(z)  from  — oo  to  0. 
Alternatively,  it  is  simpler  to  use  (13.12)  as  follows. 


pe  =  p[W  <  -A/2] 

/oo 

P[W  <  — A/2|A  =  a]pA(a)da  (from  (13.12)) 

-oo 


■oo 

■oo 


/oo 

P[W  <  — a/2|A  =  a]pA(o)da  (since  A  —  a  has  occurred). 

-oo 

But  since  W  and  A  are  independent,  P[W  <  —  a/2|A  =  a]  =  P[W  <  —a/2]  and 
thus 


Pr  = 


/OO 

P[W  <  -a/2]pA{a)da. 

-OO 


Using  P[W  <  —a/2]  =  Q(a/ 2)  we  have 


Pe  =  j  Q(a/ 2)^- exp  ^-^2"a2^  da. 


Unfortunately,  this  is  not  easily  evaluated  in  closed  form. 


❖ 


13.6  Mean  of  Conditional  PDF 

For  a  conditional  PDF  the  mean  is  given  by  the  usual  mean  definition  except  that 
the  PDF  now  depends  on  x.  We  therefore  have  the  definition 

/oo 

VPY\x{y\x)dy 

-oo 


(13.14) 
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which  is  analogous  to  (8.29)  for  discrete  random  variables.  We  also  expect  and  it 
follows  that  (see  Problem  13.19  and  also  the  discussion  in  Section  8.6) 

Ex[Ey\x[Y\X]]  =Ey[Y]  (13.15) 

where  Ey\x\Y\X]  is  given  by  (13.14)  except  that  the  value  x  is  now  replaced  by 
the  random  variable  X.  Therefore,  Ey\x[Y\X }  is  viewed  as  a  function  of  the  ran¬ 
dom  variable  X.  As  an  example,  we  saw  that  for  the  bivariate  Gaussian  PDF 
Y \{X  —  x)  ~  J\f(pxA  —  p2).  Hence,  Ey\x[Y \x\  —  px,  but  regarding  the  mean 
of  the  conditional  PDF  as  a  function  of  the  random  variable  X  we  have  that 
Ey\x  [Y\X]  —  pX.  To  see  that  (13.15)  holds  for  this  example 

Ex[Ey\x[Y\X]\  =  Ex[pX ]  =  PEx[X]  =  0 

since  the  marginal  PDF  of  X  for  the  standard  bivariate  Gaussian  PDF  was  shown 
to  be  A/"(0, 1).  Also,  since  Y  ~  A/*(0, 1)  for  the  standard  bivariate  Gaussian  PDF, 
Ey[Y]  —  0,  and  we  see  that  (13.15)  is  satisfied. 

The  mean  of  the  conditional  PDF  arises  in  optimal  prediction,  where  it  is  proven 
that  the  minimum  mean  square  error  (MMSE)  prediction  of  Y  given  X  —  x  has  been 
observed  is  Ey\x[Y\x\  (see  Problem  13.17).  This  is  optimal  over  all  predictors,  linear 
and  nonlinear.  For  the  standard  Gaussian  PDF,  however,  the  optimal  prediction 
turns  out  to  be  linear  since  Ey\x[Y\x\  —  px.  More  generally,  it  can  be  shown  that 
if  X  and  Y  are  jointly  Gaussian  with  PDF  given  by  (12.35),  then 

Ey\x[Y\x]  = 


(See  also  Problem  13.20.) 

13.7  Computer  Simulation  of  Jointly  Continuous 
Random  Variables 

In  a  manner  similar  to  that  described  in  Section  8.7  we  can  generate  realizations  of 
a  continuous  random  vector  (X,  Y)  using  the  relationship 

Px,v{x,y)  =  pY\x{y\x)px(x). 

(Of  course,  if  X  and  Y  are  independent,  we  can  generate  X  based  on  px(x)  and  Y 
based  on  pY(y)).  Consider  as  an  example  the  standard  bivariate  Gaussian  PDF.  We 
know  that  Y \(X  =  x)  ~  N(px ,  1  —  p 2)  and  X  ^  A/*(0, 1).  Hence,  we  can  generate 
realizations  of  (X,Y)  as  follows. 


COV  A,  Y  )  ,  _  r_x 

—  (x  -  Ex  [x  ) 

var(A  j 


P&X&Y  i  \ 

- o - \X  ~  PX) 

4 
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Step  1.  Generate  X  —  x  according  to  AT(0, 1). 

Step  2.  Generate  Y\(X  —  x)  according  to  Af(px,  1  —  p2). 


This  procedure  is  conceptually  simpler  than  what  we  implemented  in  Section  12.11 
and  much  more  general.  There  we  used  (12.53).  Referring  to  (12.53),  if  we  let 
pw  —  Pz  =  0,  Gyy  —  a2z  ~  l  and  make  the  replacements  of  W,  Z ,  X ,  Y  with 
X,Y,J7,F,  we  have 


X  ' 

'  1  O' 

'  u  ' 

Y 

p  y/l  —  p2 

_  v . 

(13.16) 


where  U  ~  A/"(0, 1),  V  ~  Af( 0, 1),  and  U  and  V  are  independent.  The  transformation 
of  (13.16)  can  be  used  to  generate  realizations  of  a  standard  bivariate  Gaussian 
random  vector.  It  is  interesting  to  note  that  in  this  special  case  the  two  procedures 
for  generating  bivariate  Gaussian  random  vectors  lead  to  the  identical  algorithm. 
Can  you  see  why  from  (13.16)? 

As  an  example  of  the  conditional  PDF  approach,  if  we  let  p  =  0.9,  we  have  the 
plot  shown  in  Figure  13.6.  It  should  be  compared  with  Figure  12.21  (note  that  in 
Figure  12.21  the  means  of  X  and  Y  are  1).  The  MATLAB  code  used  to  generate 


x 


Figure  13.6:  500  outcomes  of  standard  bivariate  Gaussian  random  vector  with  p  = 
0.9  generated  using  conditional  PDF  approach. 

realizations  of  a  standard  bivariate  Gaussian  random  vector  using  conditioning  is 
given  below. 

randn(* state*  ,0)  7.  set  random  number  generator  to  initial  value 
rho=0 . 9 ; 

M=500;  7o  set  number  of  realizations  to  generate 
for  m=l:M 

x(m,l)=randn(l,l)  ;  7,  generate  realization  of  N(0,1)  random 

7.  variable  (Step  1) 
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ygx(m,l)=rho*x(m)+sqrt(l-rho'‘2)*randn(l,l) ;  */,  generate 

•/.  Y I  (X=x)  (Step  2) 

end 


13.8  Real-World  Example  -  Retirement  Planning 

Professor  Staff,  who  teaches  too  many  courses  a  semester,  plans  to  retire  at  age  65. 
He  will  have  accumulated  a  total  of  $500,000  in  a  retirement  account  and  wishes  to 
use  the  money  to  live  on  during  his  retirement  years.  He  assumes  that  his  money 
will  earn  enough  to  offset  the  decrease  in  value  due  to  inflation.  Hence,  if  he  lives  to 
age  75  he  could  spend  $50,000  a  year  and  if  he  lives  to  age  85,  then  he  could  spend 
only  $25,000  a  year.  How  much  should  he  figure  on  spending  per  year? 

Besides  the  many  courses  Professor  Staff  has  taught  in  history,  English,  math¬ 
ematics,  and  computer  science,  he  has  also  taught  a  course  in  probability.  He 
therefore  reasons  that  if  he  spends  s  dollars  a  year  and  lives  for  Y  years  during  his 
retirement,  then  the  probability  that  500,000  —  sY  <  0  should  be  small.  Here  sY 
is  the  total  money  spent  during  his  retirement.  In  other  words,  he  desires 


P[500,000  -5y  <0]  =  0.5. 


(13.17) 


He  chooses  0.5  for  the  probability  of  outliving  his  retirement  fund.  This  acknowl¬ 
edges  the  fact  that  choosing  a  lower  probability  will  lead  to  an  overly  conservative 
approach  and  a  small  amount  of  expendable  funds  per  year  as  we  will  see  shortly. 
Equivalently,  he  requires  that 


_  _  500, 000 

Y  > - - - 

s 


(13.18) 


As  an  example,  if  he  spends  s  =  50, 000  per  year,  then  the  probability  he  lives  more 
than  500,000/5  =  10  years  should  be  0.5. 

It  should  now  be  obvious  that  (13.18)  is  actually  the  right-tail  probability  or 
complementary  CDF  of  the  years  lived  in  retirement.  This  type  of  information  is  of 
great  interest  not  only  to  retirees  but  also  to  insurance  companies  who  pay  annuities. 
An  annuity  is  a  payment  that  an  insurance  company  pays  annually  to  an  investor  for 
the  remainder  of  his/her  life.  The  amount  of  the  payment  depends  upon  how  much 
the  investor  originally  invests,  the  age  of  the  investor,  and  the  insurance  company’s 
belief  that  the  investor  will  live  for  so  many  years.  To  quantify  answers  to  questions 
concerning  years  of  life  remaining,  the  mortality  rate ,  which  is  the  distribution  of 
years  lived  past  a  given  age  is  required.  If  Y  is  a  continuous  random  variable  that 
denotes  the  years  lived  past  age  X  =  x,  then  the  mortality  rate  can  be  described  by 
first  defining  the  conditional  CDF 


FYlx(y\x)  =  P[Y  <  y\X  =  x}. 


450  CHAPTER  13.  CONDITIONAL  PROBABILITY  DENSITY  FUNCTIONS 


For  example,  the  probability  that  a  person  will  live  at  least  10  more  years  if  he  is 
currently  65  years  old  is  given  by 

P[Y  >  10\X  =  65]  =  1  —  Fy|X(10|65) 

which  is  the  complementary  CDF  or  the  right-tail  probability  of  the  conditional  PDF 
Py\x(v\x )•  ^  has  been  shown  that  for  Canadian  citizens  the  conditional  CDF  is  well 
modeled  by  [Milevsky  and  Robinson  2000] 

Fy\x{y\x)  =  1  -  exp  exp  (X  ^  m)  (l  -  exp  (y))  y  >  0  (13.19) 

where  m  =  81.95,/  =  10.6  for  males  and  m  —  87.8,/  =  9.5  for  females.  As  an 
example,  if  FY\x(y\x )  —  0.5,  then  you  have  a  50%  chance  of  living  more  than  y 
years  if  you  are  currently  x  years  old.  In  other  words,  50%  of  the  population  who 
are  x  years  old  will  live  more  than  y  years  and  50%  will  live  less  than  y  years.  The 
number  of  years  y  is  the  median  number  of  years  to  live.  (Recall  that  the  median  is 
the  value  at  which  the  probability  of  being  less  than  or  equal  to  this  value  is  0.5.) 
From  (13.19)  this  will  be  true  when 

0.5  =  exp  exp  (a  f  m)  (l  -  exp  (y)) 

which  results  in  the  remaining  number  of  years  lived  by  50%  of  the  population  who 
are  currently  x  years  old  as 


y  —  l  In  1  —  exp  I  — 


x  —  m 


In  0.5 


(13.20) 


This  is  plotted  in  Figure  13.7a  versus  the  current  age  x  for  males  and  females.  In 
Figure  13.7a  the  median  number  of  years  left  is  shown  while  in  Figure  13.7b  the 
median  life  expectancy  (which  is  x  +  y)  is  given. 

Returning  to  Professor  Staff,  he  can  now  determine  how  much  money  he  can 
afford  to  spend  each  year.  Since  the  probability  of  outliving  one’s  retirement  funds 
is  a  conditional  probability  based  on  current  age,  we  rewrite  (13.18)  as 


_ _  500, 000 

Y  >  - 2 - 

s 


where  we  allow  the  probability  to  be  denoted  in  general  by  Pi.  Since  he  will  retire 
at  age  x  =  65,  we  have  from  (13.19)  that  he  will  live  more  than  y  years  with  a 


probability  of  Pi  given  by 


Pl  -  exp 


(13.21) 
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(a)  Years  to  live,  y  (b)  Life  expectancy,  x  4-  y 

Figure  13.7:  Mortality  rates. 


Figure  13.8:  Probability  Pl  of  exceeding  y  years  in  retirement  for  male  who  retires 
at  age  65. 


Assuming  Professor  Staff  is  a  male,  we  use  m  =  81.95,  l  =  10.6  in  (13.21)  to  produce 
a  plot  Pl  versus  y  as  shown  in  Figure  13.8.  If  the  professor  is  overly  conservative, 
he  may  want  to  assure  himself  that  the  probability  of  outliving  his  retirement  fund 
is  only  about  0.1.  Then,  he  should  plan  on  living  another  27  years,  which  means 
that  his  yearly  expenses  should  not  exceed  $500,000/27  =  $18,500.  If  he  is  less 
conservative  and  chooses  a  probability  of  0.5,  then  he  can  plan  on  living  about  15 
years.  Then  his  yearly  expenses  should  not  exceed  $500,000/15  «  $33,000. 
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Problems 


13.1  (w,c)  In  this  problem  we  simulate  on  a  computer  the  dartboard  outcomes  of 
the  novice  player  for  the  game  shown  in  Figure  13.1a.  To  do  so,  generate 
two  independent  U(— 1,1)  random  variables  to  serve  as  the  x  and  y  outcomes. 
Keep  only  the  outcomes  ( x ,  y )  for  which  \Jx2  +  y2  <  1  (see  Problem  13.23  for 
why  this  produces  a  uniform  joint  PDF  within  the  unit  circle).  Then,  of  the 
kept  outcomes  retain  only  the  ones  for  which  A#/2  <  0.2  (see  Figure  13.2a). 
Finally,  estimate  the  probability  that  the  novice  player  obtains  a  bullseye  and 
compare  it  to  the  theoretical  value.  Note  that  the  theoretical  value  of  0.25 
as  given  by  (13.2)  is  actually  an  approximation  based  on  the  areas  in  Figure 
13.1b  being  rectangular. 

13.2  (^)  (w)  Determine  if  the  proposed  conditional  PDF 


/  .  x  ,  cexp(— y/x)  y>0,x>0 
PYlX»)  H  „  otherwise 


is  a  valid  conditional  PDF  for  some  c.  If  so,  find  the  required  value  of  c. 


13.3  (w)  Is  the  proposed  conditional  PDF 


Py\x{v\x)  =  ^==exp 

V  ^ 


2  (y~x)2 


oo  <  y  <  oo,  — oo  <  x  <  oo 


valid?  If  so,  and  if  X  ~  J\f( 0, 1),  design  an  experiment  that  will  produce  the 
random  variables  X  and  Y. 


13.4  (0)  (f)  If 


Px,v{x,y)  =  | 


2exp[—  (x  +  y)]  0  <  y  <  x,x  >  0 


0 


otherwise 


find  pY\x{y\x)- 

13.5  (w)  Plot  the  joint  PDF 


Px,y{x,y) 


2x  0  <  x  <  1,0  <  y  <  1 
0  otherwise. 
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Next  determine  by  inspection  the  conditional  PDF  Py\x(v\x)-  Recall  that  the 
conditional  PDF  is  just  the  normalized  cross-section  of  the  joint  PDF. 

13.6  (t)  In  this  problem  we  show  that 

lim  P[a  <Y  <  b\x  —  Ax/2  <  X  <  x  +  Ax/2]  =  f  PY\x(y\x)dy- 
Az->0  Ja 

To  do  so  first  show  that 


lim  P[a  <Y  <  b\x  —  Ax/2  <  X  <  x  +  Ax/2] 

Ax-^0 


lim 
Ax— >0 


fx-Ax/2  PX,y(x>  V)dx 


fx-Ax/2  Px(x)dx 


13.7  (f)  Determine  P[Y  >  ^|X  =  0]  if  the  joint  PDF  is  given  as 


Pxy 


2x 

0 


0  <  x  <  1,0  <  y  <  1 

otherwise. 


13.8  (0)  (f)  If  X  -  U( 0, 1)  and  Y\(X  =  x)  ~  U{ 0,®),  find  the  joint  PDF  for  X 
and  Y  and  also  the  marginal  PDF  for  Y. 

13.9  (f,t)  For  the  standard  bivariate  Gaussian  PDF  find  the  conditional  PDFs  py\x 
and  Px\y  an<i  compare  them.  Explain  your  results.  Are  your  results  true  in 
general? 

13.10  (^)  (f)  If  the  joint  PDF  px,Y  is  uniform  over  the  region  0  <  y  <  x  and 
0  <  x  <  1  and  zero  otherwise,  find  the  conditional  PDFs  Py\x  an(i  Px\Y- 

13.11  (t)  Prove  Properties  P13. 1-13.5. 

13.12  (f)  Determine  the  PDF  of  Y/X  if  X  -  Af(0, 1),  Y  -  Af( 0, 1)  and  X  and  Y 
are  independent.  Do  so  by  using  the  conditioning  approach. 

13.13  (t)  Prove  that  the  PDF  of  X  +  T,  where  X  and  Y  are  independent  is  given  as 
a  convolution  integral  (see  (12.14)).  Do  so  by  using  the  conditioning  approach. 

13.14  (^)  (w)  A  game  of  darts  is  played  using  the  linear  dartboard  shown  in 
Figure  3.8.  If  two  novice  players  throw  darts  at  the  board  and  each  one’s  dart 
is  equally  likely  to  land  anywhere  in  the  interval  (—1/2, 1/2),  prove  that  the 
probability  of  player  2  winning  is  1/2.  Hint:  Let  X\  and  X2  be  the  outcomes 
and  use  Y  =  |X2 1  —  |Xi|  and  X  =  X\  in  (13.12). 

13.15  (t)  Prove  (13.12)  by  starting  with  (13.4). 
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13.16  (^)  (w)  A  resistor  is  chosen  from  a  bin  of  10  ohm  resistors  whose  distri¬ 
bution  satisfies  R  ~  ^(10,0.25).  A  i  =  1  amp  current  source  is  applied  to 
the  resistor  and  the  subsequent  voltage  V  is  measured  with  a  voltmeter.  The 
voltmeter  has  an  error  E  that  is  modeled  as  E  ~  M{ 0, 1).  Find  the  probability 
that  V  >  10  volts  if  an  11  ohm  resistor  is  chosen.  Note  that  V  =  iR  +  E. 
What  assumption  do  you  need  to  make  about  the  dependence  between  R  and 
El 

13.17  (t)  In  this  problem  we  prove  that  the  minimum  mean  square  error  estimate 
of  Y  based  on  X  =  x  is  given  by  EY\x\X\x\-  First  let  the  estimate  be  denoted 

A 

by  Y  ( x )  since  it  will  depend  in  general  on  the  outcome  of  X.  Then  note  that 
the  mean  square  error  is 


mse 


=  Ex>y[(Y-Y(X))2] 

a00 

(y  -  Y(x)Ypx,Y{x,y)dx dy 

-OO 


— OO  J  — OO 
‘OO  POO 


/OO  POO 

/  (y  -Y{x))2PY\x{y\x)px{x)dxdy 
-oo  J  — OO 


/oo  r  poo 

/  (y-Y(x))2PY\x(y\x)dy 
-oo  L J — oo 

V. 


px(x)dx. 


J(Y(x)) 


A 

Now  we  can  minimize  J(Y(x))  for  each  value  of  x  since  px(x)  >  0.  Complete 

A 

the  derivation  by  differentiating  J(Y(x))  and  setting  the  result  equal  to  zero. 

A 

Consider  Y  (x)  as  a  constant  (since  x  is  assumed  fixed  inside  the  inner  integral) 
in  doing  so.  Finally  justify  all  the  steps  in  the  derivation. 


13.18  (f )  For  the  joint  PDF  given  in  Problem  13.10  find  the  minimum  mean  square 
error  estimate  of  Y  given  X  —  x.  Plot  the  region  in  the  x-y  plane  for  which 
the  joint  PDF  is  nonzero  and  also  the  estimated  value  of  Y  versus  x. 


13.19  (t)  Prove  (13.15). 

13.20  (w,c)  If  a  bivariate  Gaussian  PDF  has  a  mean  vector  [fix  Hy]T  —  [1 2]T  and 
a  covariance  matrix 

c- [ 2  1  1 

[ 1  2  J 

plot  the  contours  of  constant  PDF.  Next  find  the  minimum  mean  square  error 
prediction  of  Y  given  X  =  x  and  plot  it  on  top  of  the  contour  plot.  Explain 
the  significance  of  the  plot. 

13.21  (^)  (w)  A  random  variable  X  has  a  Laplacian  PDF  with  variance  a2.  If 
the  variance  is  chosen  according  to  a2  ~  U( 0,1),  what  is  average  variance  of 
the  random  variable? 
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13.22  (c)  In  this  problem  we  use  a  computer  simulation  to  illustrate  the  known 
result  that  Ey\x\X\x]  —  Px  f°r  (^Y,  Y)  distributed  according  to  a  standard 
bivariate  Gaussian  PDF.  Using  (13.16)  generate  M  =  10,000  realizations  of 
a  standard  bivariate  Gaussian  random  vector  with  p  =  0.9.  Then  let  A  = 
{x  :  xo  —  Ax/ 2  <  x  <  xq  +  Ax/2}  and  discard  the  realizations  for  which  x 
is  not  in  A.  Finally,  estimate  the  mean  of  the  conditional  PDF  by  taking  the 
sample  mean  of  the  remaining  realizations.  Choose  Ax/ 2  =  0.1  and  xo  =  1 
and  compare  the  theoretical  value  of  Ey\x\X\x]  t°  the  estimated  value  based 
on  your  computer  simulation. 

13.23  (t)  We  now  prove  that  the  procedure  described  in  Problem  13.1  will  produce 
a  random  vector  (X,  Y)  that  is  uniformly  distributed  within  the  unit  circle. 
First  consider  the  polar  equivalent  of  (Y,  Y),  which  is  (ii,  ©),  so  that  the 
conditional  CDF  is  given  by 

P[R  <  r,  0  <  0\R  <  1]  0  <  r  <  1, 0  <  9  <  2tt. 

But  this  is  equal  to 

P[R  <  r,i?  <  1,0  <  9]  __  P[R  <  r,0  <  6] 

P[R  <  1]  ”  P[R  <  1]  ' 


(Why?)  Next  show  that 


Q'P ^ 

P[R  <  r,  0  <  0\R  <  1]  =  — 

z7 r 

and  differentiate  with  respect  to  r  and  then  9  to  find  the  joint  PDF  p^,©(r,  9) 
(which  is  actually  a  conditional  joint  PDF  due  to  the  conditioning  on  the  value 
of  R  being  r  <  1).  Finally,  transform  this  PDF  back  to  that  of  (X,  Y)  to  verify 
that  it  is  uniform  within  the  unit  circle.  Hint:  You  will  need  the  result 


13.24  (0)  (f,c)  For  the  conditional  CDF  of  years  left  to  live  given  current  age, 
which  is  given  by  (13.19),  find  the  conditional  PDF.  Plot  the  conditional  PDF 
for  a  Canadian  male  who  is  currently  50  years  old  and  also  for  one  who  is  75 
years  old.  Next  find  the  average  life  span  for  each  of  these  individuals.  Hint: 
You  will  need  to  use  a  computer  evaluation  of  the  integral  for  the  last  part. 

13.25  (t)  Verify  that  the  conditional  CDF  given  by  (13.19)  is  a  valid  CDF. 


Chapter  14 


Continuous  iV-Dimensional 
Random  Variables 

14.1  Introduction 

This  chapter  extends  the  results  of  Chapters  10-13  for  one  and  two  continuous 
random  variables  to  N  continuous  random  variables.  Our  discussion  will  mirror 
Chapter  9  quite  closely,  the  difference  being  the  consideration  of  continuous  rather 
than  discrete  random  variables.  Therefore,  the  descriptions  will  be  brief  and  will 
serve  mainly  to  extend  the  usual  definitions  for  one  and  two  jointly  distributed  con¬ 
tinuous  random  variables  to  an  iV-dimensional  random  vector.  One  new  concept 
that  is  introduced  is  the  orthogonality  principle  approach  to  prediction  of  the  out¬ 
come  of  a  random  variable  based  on  the  outcomes  of  several  other  random  variables. 
This  concept  will  be  useful  later  when  we  discuss  prediction  of  random  processes  in 
Chapter  18. 


14.2  Summary 

The  probability  of  an  event  defined  on  an  iV-dimensional  sample  space  is  given  by 
(14.1).  The  most  important  example  of  an  iV-dimensional  PDF  is  the  multivariate 
Gaussian  PDF,  which  is  given  by  (14.2).  If  the  components  of  the  multivariate 
Gaussian  random  vector  are  uncorrelated,  then  they  are  also  independent  as  shown 
in  Example  14.2.  Transformations  of  random  vectors  yield  the  transformed  PDF 
given  by  (14.5).  In  particular,  linear  tranformations  of  Gaussian  random  vectors 
preserve  the  Gaussian  nature  but  change  the  mean  vector  and  covariance  matrix  as 
discussed  in  Example  14.3.  Expected  values  are  described  in  Section  14.5  with  the 
mean  and  variance  of  a  linear  combination  of  random  variables  given  by  (14.8)  and 
(14.10),  respectively.  The  sample  mean  random  variable  is  introduced  in  Example 
14.4.  The  joint  moment  is  defined  by  (14.13)  and  the  joint  characteristic  function 
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by  (14.15).  Joint  moments  can  be  found  from  the  characteristic  function  using 
(14.17).  The  PDF  for  a  sum  of  independent  and  identically  distributed  random 
variables  is  conveniently  determined  using  (14.22).  The  prediction  of  the  outcome 
of  a  random  variable  based  on  a  linear  combination  of  the  outcomes  of  other  random 
variables  is  given  by  (14.24).  The  linear  prediction  coefficients  are  found  by  solving 
the  set  of  simultaneous  linear  equations  in  (14.27).  The  orthogonality  principle 
is  summarized  by  (14.29)  and  illustrated  in  Figure  14.3.  Section  14.9  describes 
the  computer  generation  of  a  multivariate  Gaussian  random  vector.  Finally,  section 
14.10  applies  the  results  of  this  chapter  to  the  real-world  problem  of  signal  detection 
with  the  optimal  detector  given  by  (14.33). 

14.3  Random  Vectors  and  PDFs 

An  iV-dimensional  random  vector  will  be  denoted  by  either  (Xi,  X2, . . . ,  Xn)  or 
X  =  [X\  X2  . . .  Xn]t .  It  is  defined  as  a  mapping  from  the  original  sample  space  of 
the  experiment  to  a  numerical  sample  space  Sxi,x2,...,xN  —  RN •  Hence,  X  takes  on 
values  in  the  iV-dimensional  Euclidean  space  RN  so  that 

Xi(s)  ' 

X2(s) 

XN(s )  _ 

will  have  values 

Xi 

X2 

X  = 

_  %N 

where  x  is  a  point  in  RN .  The  number  of  possible  values  is  uncountably  infinite.  As 
an  example,  we  might  observe  the  temperature  on  each  of  N  successive  days.  Then, 
the  elements  of  the  random  vector  would  be  X\  ( s )  =  temperature  on  day  1,  ^2(5)  = 
temperature  on  day  2,  . . .,  Xn(s)  =  temperature  on  day  AT,  and  each  temperature 
measurement  would  take  on  a  continuum  of  values. 

To  compute  probabilities  of  events  defined  on  Sxi,x2,...,xN  we  will  define  the 
iV-dimensional  joint  PDF  (or  more  succinctly  just  the  PDF)  as 

PX (*^1)  ^2)  •  *  •  5  *^iv) 

and  sometimes  use  the  more  compact  notation  px(x).  The  usual  properties  of  a 
joint  PDF  must  be  valid 

PX\  (^T  5  ^2 ?  •  •  •  ?  ^iv)  ^  0 

/oo  roo  noo 

/  •••  /  pxux2,-,xN(xi,X2,...,xN)dxidx2...dxN  =  1. 

-oo  J — oo  J  — oo 
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Then  the  probability  of  an  event  A  defined  on  RN  is  given  by 

*  /  Pxux2,...,xn(xi,X2,  • .  •  ,xN)dxidx2  .  >  -  dxN.  (14.1) 

Ja 

The  most  important  example  of  an  AT-dimensional  joint  PDF  is  the  multivariate 
Gaussian  PDF.  This  PDF  is  the  extension  of  the  bivariate  Gaussian  PDF  described 
at  length  in  Chapter  12  (see  (12.35)).  It  is  given  in  vector/matrix  form  as 

JPx(x)  = - 7 - TTT - exp  —  ^(x  —  /i)TC_1(x  —  n)  (14.2) 

V  '  (2n)N/2  det1/2(C)  l  2l  V  H  V  ' 

where  fx  =  [pi  ^2  •  •  •  Hn]T  is  the  N  x  1  mean  vector  so  that 

EXl[Xi] 

Ex2  [X2] 

m 

Exn  [Xn] 

and  C  is  the  N  x  N  covariance  matrix  defined  as 

var(Xi)  cov(Xi,X2)  ...  cov(Xi,Xjy) 

cov(X2,Xi)  var(X2)  ...  cov(X2,Xn) 

•  •  •  • 

•  ■  •  ■ 

•  •  •  ■ 

covpGv,Xi)  cov(Xtv,X2)  ...  var(X^v) 

Note  that  C  is  assumed  to  be  positive  definite  and  so  it  is  invertible  and  has  det(C)  > 
0  (see  Appendix  C).  If  the  random  variables  have  the  multivariate  Gaussian  PDF, 
they  are  said  to  be  jointly  Gaussian  distributed.  Note  that  the  covariance  matrix 
can  also  be  written  as  (see  (Problem  9.21)) 

C  =  Ex  [(X  -  /i)(X  -  /x)T]  . 

To  denote  a  multivariate  Gaussian  PDF  we  will  use  the  notation  A/*(/x,  C).  Clearly, 
for  N  =  2  we  have  the  bivariate  Gaussian  PDF.  Evaluation  of  the  probability  of 
an  event  using  (14.1)  is  in  general  quite  difficult.  Progress  can,  however,  be  made 
when  A  is  a  simple  geometric  region  in  RN  and  C  is  a  diagonal  matrix.  An  example 
follows. 

Example  14.1  -  Probability  of  a  point  lying  within  a  sphere 

Assume  N  =  3  and  let  X  ~  A/^(0,a2I).  We  will  determine  the  probability  that 
an  outcome  falls  within  a  sphere  of  radius  R.  The  event  is  then  given  by  A  = 
{(xi,X2,xs)  :  x\  +  x\  +  x\  <  R2}.  This  event  might  represent  the  probability  that 
a  particle  with  mass  m  and  random  velocity  components  Vx,Vy,Vz  has  a  kinetic 
energy  £  =  (1/2 )m(y2  +  Vy  +  Vz2)  less  than  a  given  amount.  This  modeling  is 
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used  in  the  kinetic  theory  of  gases  [Resnick  and  Halliday  1966]  and  is  known  as  the 
Maxwellian  distribution.  Prom  (14.2)  we  have  with  fjt  =  0,  C  =  <j2I,  and  N  =  3 


™  -  / 1 L 

-  Ill, 


A  (27r)3/2  det1//2(a2I) 


exp 


_"^xT(cr2I)""1x 


dx  i  dx  2  dx  3 


^4  (27TCT2)3/2 


exp 


1 


^(xl+x2  +  x3) 


dx i 


since  det(cr2I)  =  (cr2)3  and  (cr2I)  1  =  (1/<t2)I.  Next  we  notice  that  the  region  of 
integration  is  the  inside  of  a  sphere.  As  a  result  of  this  and  the  observation  that 
the  integrand  only  depends  on  the  squared-distance  of  the  point  from  the  origin,  a 
reasonable  approach  is  to  convert  the  Cartesian  coordinates  to  spherical  coordinates. 
Doing  so  produces  the  inverse  transformation 


x\  =  rcos0sin</> 
X2  —  r  sin  9  sin  </> 
xs  =  r  cos  <j) 


where  r  >  0,  0  <  6  <  2i r,  0  <  <fi  <  ir.  We  must  be  sure  to  include  in  the  integral  over 
r,  0,  <fi  the  absolute  value  of  the  Jacobian  determinant  of  the  inverse  transformation 
which  is  r2sin0  (see  Problem  14.5).  Thus, 


P[A]  =  l  l  l  P^j575eXp('2^r2)r2si,1*‘iM*‘ir 


pR  pit 

1  r2 

Jo  Jo 

(27TCT2)3/2 

_  fR 

1  2 

nr*  Avrv 

Jo  (27T<72)3/2  * 

_  fR 

4?r  o 

nr*  ^  Avrv 

Jo  (27TCT2)3/2  P 

/  2 

/>«  r2  / 

V  Tra2  „ 

/o  a2eXPt 

2<t2 


r2  ]  2tt  sin  <f>d(f)  dr 


2^2r  )  ‘Zk  {-cos 4>)\ldr 


N/- 

2 


l 


2a2 


1 


2a2 


To  evaluate  the  integral 


t-R  r2 

Jo 


1 


exp 


2a2 


dr 


we  use  integration  by  parts  (see  Problem  11.7)  with  U  =  r  and  hence  dU  =  dr  and 
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dV  =  (r / a2)  exp[—r2 / (2a2)]dr  so  that  V  =  —  exp[— r2/(2a2)].  Then 


r  exp 


=  —R  exp 


=  —R  exp 


2 

1 


2  /  2 
r  jo 


R  fR 
+  /  exp 

o  Jo 


1  2  /  2 
~2r  /<T 


dr 


AR2/° 


+  V27 nr2 


rR  1 

7o  V2/7T(J2 


exp 


1 


2  /  _2 


~2r  ° 


dr 


—  ^ -R2/<7 2  +  \/27ro-2  [Q(0)  -  Q(i?/cr)] . 


Finally,  we  have  that 


p[^]  = 


7TCT 


— i?exp 


+  V^ro2  [Q(0)  —  Q(R/a)] 


(-1- 


=  1  -  2Q(P/ct)  -  t  /  ^  Pexp  ( -^P2/(t2  )  . 


0 

The  marginal  PDFs  are  found  by  integrating  out  the  other  variables.  For  exam¬ 
ple,  if  px  1(^1)  is  desired,  then 

/OO  POO  POO 

/  •••/  Pxux2,...,xN{xi,X2,...,xN)dx2dx3...dxN. 

As  an  example,  for  the  multivariate  Gaussian  PDF  it  can  be  shown  that  ~ 
A/*(Mz,of),  where  of  =  var(X^)  (see  Problem  14.16).  Also,  the  lower  dimensional 
joint  PDFs  are  similarly  found.  To  determine  Pxuxn(xi,xn)  for  example,  we  use 

/OO  POO  POO 

I  I  PX\ ,x2 (x\ ?  ^2 5  •  •  •  5  xN)dx 2  dx$  . . .  dxjy— \ . 

-00  «/ — 00  ^  — 00 

The  random  variables  are  defined  to  be  independent  if  the  joint  PDF  factors  into 
the  product  of  the  marginal  PDFs  as 

Px ux2,...,xn(xux2,  . . .  ,xN)  =  px1(xi)px2{x2)  •  --Pxn(xn)-  (14.3) 
An  example  follows. 

Example  14.2  —  Condition  for  independence  of  multivariate  Gaussian 
random  variables 

If  the  covariance  matrix  for  a  multivariate  Gaussian  PDF  is  diagonal,  then  the 
random  variables  are  not  only  uncorrelated  but  also  independent  as  we  now  show. 
Assume  that 


C  =  diag  (cr2,  a2,  •  •  • ,  °n) 
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then  it  follows  that 


N 


det(C)  =  Y[a\ 


i— 1 


C  1  =  diag 


1  1 


1 


2  ’  2  ’ 
a(  a* 


* 5 


G 


N 


Using  these  results  in  (14.2)  produces 


Px(x) 


(2n)N/2  (n£,  of) 


1/2 


n£.\Af 


exp 


exp 


N 


-§(x~M)Tdiag 


1  1 


°1 


-  w)2M? 


2=1 


exp 


2—1 

AT 


2=1 


2  ’  ^2  5  *  *  *  ’  Jl 


a 


(x-m) 


N 


where  X{  ~  Hence ,  if  a  random  vector  has  a  multivariate  Gaussian  PDF 

and  the  covariance  matrix  is  diagonal ,  which  means  that  the  random  variables  are 
uncorrelatedj  then  the  random  variables  are  also  independent. 

0 


A  Uncorrelated  implies  independence  only  for  multivariate  Gaus¬ 
sian  PDF  even  if  marginal  PDFs  are  Gaussian! 

Consider  the  counterexample  of  a  PDF  for  the  random  vector  (X,  Y)  given  by 


Px,r{x,y) 


2  2tt  \/l  —  p2 
1 


exp 


1 


1 

+  — 


2  27t 


exp 


2(1  -  p2) 

1 


2(1  -  p2) 


( X 2  -  2 pxy  +  y2) 


{x2  +  2  pxy  +  y2) 


(14.4) 


for  0  <  p  <  1.  This  PDF  is  shown  in  Figure  14.1  for  p  =  0.9.  Clearly,  the  random 
variables  are  not  independent.  Yet,  it  can  be  shown  that  X  r^>  jV(o,i),y  ~jV(o,i), 
and  X  and  Y  are  uncorrelated  (see  Problem  14.7).  The  difference  here  is  that  the 
joint  PDF  is  not  a  bivariate  Gaussian  PDF. 


A 
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(a)  Joint  PDF 


(b)  Constant  PDF  contours 


Figure  14.1:  Uncorrelated  but  not  independent  random  variables  with  Gaussian 
marginal  PDFs. 


A  joint  cumulative  distribution  function  (CDF)  can  be  defined  in  the  iV-dimensional 
case  as 


Fx! ,X2,...,X jv  (*^T ?  •  •  •  5  *^iv)  F\X\  ^  X\ ,  N-2  ^  %2 5  •  •  •  7  ^  *^jv] • 

It  has  the  usual  properties  of  being  between  0  and  1,  being  monotonically  increasing 
as  any  of  the  variables  increases,  and  being  “right  continuous” .  Also, 

Fxux2,...,xn(-oo,  -oo,  ...,-oo)  =  0 

Fx\  ,X2 ,...,Xjy  (Too,  Too,  •  •  • ,  Too)  1. 

The  marginal  CDFs  are  easily  found  by  letting  the  undesired  variables  be  evaluated 
at  -boo.  For  example,  to  determine  the  marginal  CDF  for  Xi,  we  have 

Fx i  (*^T)  FX i ,X2l...,X jy  (*^1 5  TOO,  ~boo,  •  •  • ,  “boo) . 

14.4  Transformations 

We  consider  the  transformation  from  X  to  Y  where 

Yi  =  <71(^17^2,  ••• ,  A'jv) 

Y2  =  92(X  UX2,...,XN) 

•  • 

•  • 

•  • 

Yn  =  9n(X  i,X2,...,Xn) 
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and  the  transformation  is  one-to-one.  Hence  Y  is  a  continuous  random  vector  having 
a  joint  PDF  (due  to  the  one-to-one  property).  If  we  wish  to  find  the  PDF  of  a  subset 
of  the  Yi  s,  then  we  need  only  first  find  the  PDF  of  Y  and  then  integrate  out  the 
undesired  variables.  The  extension  of  (12.22)  for  obtaining  the  joint  PDF  of  two 
transformed  random  variables  is 


PyuY2,...,Yn  (yi,  2/2,  •  •  •  >  Un) 


Px1,x2,...,xN(9i1(y),921(y) 


9n(  y)) 


det 


(d(xi,X2,...,XN)\ 

\d(yi,y2,---,yN) ) 


where 


Xl 

X2 

XN 

d(x!,X2,. 

d(yi,V2,f 

•  • 

•  • 

**-*  NJ 

II 

■  dx\ 
dyi 

8X2. 

dyi 

• 

• 

• 

dxi 

dy2 

dx2 

dy2 

• 

• 

• 

dx\ 

• • ‘  dyN 

dx2 

' ’  ’  dyN 

•  * 

•  ■ 

*  • 

3xn 

L  dyi 

dxjsr 

dy2 

dxjsr 

’  *  ’  dyN 

(14.5) 


is  the  inverse  Jacobian  matrix.  An  example  follows. 

Example  14.3  —  Linear  transformation  of  multivariate  Gaussian  random 
vector 

If  X  ~  J\7(/x,  C)  and  Y  =  GX,  where  G  is  an  invertible  N  x  N  matrix,  then  we 
have  from  y  =  Gx  that 


x  = 
9x 

¥ 


Cy 

G1. 


Hence,  using  (14.5)  and  (14.2) 


py(  y) 


px{ G  V)  |det(G  x)| 

1 

(2n)N/2  det1/2(C)  6XP 

_ 1 _ 

(2ir)N/2  det1/2(GCGT) 


1 


(G-'y-n)1  C-L(G-Ly-») 


J  I  det (G)  | 


exp 


l(y-G/i)T(GCGT)-1(y-G/i) 


(see  Section  12.7  for  details  of  matrix  manipulations)  so  that  Y  ~  A/*(G/la,  GCGt). 
This  result  is  the  extension  of  Theorem  12.7  from  2  to  N  jointly  Gaussian  random 
variables.  See  also  Problems  14.8  and  14.15  for  the  case  where  G  is  M  x  N  with 
M  <  N.  It  is  shown  there  that  the  same  result  holds. 

❖ 
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14.5  Expected  Values 


The  expected  value  of  a  random  vector  is  defined  as  the  vector  of  the  expected  values 
of  the  elements  of  the  random  vector.  This  says  that  we  define 


‘  Xx  ' 

’  ExAX  1]  ' 

Ei c[X]  =  EXl,x2,...,xN 

x2 

Ex2  [X2] 

• 

■ 

— 

• 

• 

.  XN  . 

_  EXn[Xn]  . 

(14.6) 


We  can  view  this  definition  as  “passing”  the  expectation  “through”  the  left  bracket 
of  the  vector  since  Exl,x2,..;XN[Xi]  =  Exi[Xj\.  A  particular  expectation  of  interest 
is  that  of  a  scalar  function  of  Xi,  . . .  ,Xjv,  say  g(Xi,X 2, . . .  ,Xn).  Similar  to 
previous  results  (see  (12.28))  this  is  determined  using 


EXl  ,X2,...,XN  [9(X1,X2,...,Xn)} 

/OO  POO  POO 

I  I  1  ?  ^■'2?  •  •  •  5  X N^PX, ,X2,...,Xjv  {x\  ?  ^2?  -  •  •  1  x  x^)dx\  dx 2  .  .  .  dxpj . 

-00  J  — 00  J  — OO 

(14.7) 

Some  specific  results  of  interest  are  the  linearity  of  the  expectation  operator  or 


Exux2,...,xn 


N 


L*=l 


N 


=  YjaiEXi[Xi] 


i= 1 


and  in  particular  if  a*  =  1  for  all  then  we  have 


Ex  ux2,...,xN 


N 

X> 

L  2= 1 


N 


=  j2EXi[Xii 


i= 1 


The  variance  of  a  linear  combination  of  random  variables  is  given  by 

(N  \  N  N 

y^^jXj  J  =  qigicov(Xi1  Xj)  =  aTCxa 

i= 1  /  i=l  1=1 


(14.8) 


(14.9) 


(14.10) 


where  Cx  is  the  covariance  matrix  of  X  and  a  =  [a\  a2  . . .  ax]T .  The  derivation  of 
(14.10)  is  identical  to  that  given  in  the  proof  of  Property  9.2  for  discrete  random 
variables.  If  the  random  variables  are  uncorrelated  so  that  the  covariance  matrix  is 
diagonal  or 

Cx  =  diag(var(Xi),  var(X2) . . . ,  var(Xjv)) 
then  (see  Problem  14.10) 


N  \  N 
V'ajX;  =  y'a-var(Xj). 


var 


(14.11) 
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If  furthermore,  ai  =  1  for  all  i:  then 


var 


N 


-  ^var(^). 
i= 1 


(14.12) 


An  example  follows. 

Example  14.4  -  Sample  mean  of  independent  and  identically  distributed 
random  variables 

Assume  that  Xi,  X2, . . .  ,  X#  are  independent  random  variables  and  each  ran¬ 
dom  variable  has  the  same  marginal  PDF.  When  random  variables  have  the  same 
marginal  PDF,  they  are  said  to  be  identically  distributed.  Hence,  we  are  assuming 
that  the  random  variables  are  independent  and  identically  distributed  (IID).  As  a 
consequence  of  being  identically  distributed,  Ex%  [Xi]  =  p  and  var(X^)  =  a2  for  all 
i.  It  is  of  interest  to  examine  the  mean  and  variance  of  the  random  variable  that  we 
obtain  by  averaging  the  X*’s  together.  This  averaged  random  variable  is 


and  is  called  the  sample  mean  random  variable.  We  have  previously  encountered 
the  sample  mean  when  referring  to  an  average  of  a  set  of  outcomes  of  a  repeated 
experiment,  which  produced  a  number.  Now,  however,  X  is  a  function  of  the  random 
variables  Xi,  X2,  .  . . ,  Xjy  and  so  is  a  random  variable  itself.  As  such  we  may  consider 
its  probabilistic  properties  such  as  its  mean  and  variance.  The  mean  is  from  (14.8) 
with  ai  =  1/N 

1  N 

EXl,x2,...,xNm  =  = f* 

i= 1 

and  the  variance  is  from  (14.11)  with  ai  =  1/N  (since  Xj’s  are  independent  and 
hence  uncorrelated) 


var(X)  = 


1 

7p 


N 


E-2 


Note  that  on  the  average  the  sample  mean  random  variable  will  yield  the  value  /i, 
which  is  the  expected  value  of  each  X{.  Also  as  N  — >•  oo,  var  (A)  -»  0,  so  that  the 
PDF  of  X  will  become  more  and  more  concentrated  about  /i.  In  effect,  as  AT  oo, 
we  have  that  X  — ¥  fi.  This  says  that  the  sample  mean  random  variable  will  converge 
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to  the  true  expected  value  of  Xi.  An  example  is  shown  in  Figure  14.2  in  which  the 
marginal  PDF  of  each  X{  is  Af( 2, 1).  In  the  next  chapter  we  will  prove  that  X  does 
indeed  converge  to  Ex{  [X{]  —  p>. 


(a)  N  =  10  (b)  N  =  100 

Figure  14.2:  Estimated  PDF  for  sample  mean  random  variable,  X. 


14.6  Joint  Moments  and  the  Characteristic  Function 

The  joint  moments  corresponding  to  an  iV-dimensional  PDF  are  defined  as 

eXuX2_Xn[x['x12>...x1J'} 

/oo  roo  roc 

I  I  *^2  *  *  *  ^ N  PXi,X2 (*^1 5  j  X x^)dx\  dx 2  .  .  .  dxjy . 

-oo  J  — oo  J  — oo 

(14.13) 

As  usual,  if  the  random  variables  are  independent ,  the  joint  PDF  factors  and  there¬ 
fore 


EXl,x2,...,xN[X['XlJ  . . .  X1"}  =  EXl[X[']EX2[Xl2'} . . .  EXn[X1£}.  (14.14) 

The  joint  characteristic  function  is  defined  as 

4>Xi,X2,--,XN{ui,U2,  •  ■  •  ,Wjv)  =  EXuX2,...,Xn  [exP[j(wl^l  +  ^2^2  H - h  WjyXjv)]] 

(14.15) 

and  is  evaluated  as 

<j>XuX2,...,XN{ui,U)2, .  ■  -,0Jn) 
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exp[j(u)ixi  +  UJ2X2  H - h  wn%n)] 

PX\,x2,...,Xn  l?  *^2?  •  •  •  5  xjy^jdx  1  dx 2  •  •  •  dxf\[ . 


In  particular,  for  independent  random  variables,  we  have  (see  Problem  14.13) 


=  4>X1(ui)4>X2{UJ2)  •  •  •  (j>XN{vN)- 

Also,  the  joint  PDF  can  be  found  from  the  joint  characteristic  function  using  the 
inverse  Fourier  transform  as 


PXi,X2,...,Xn  (*£  1?  •  •  •  1  ^iv) 


POO 

POO 

POO 

1  — OO  * 

/ 

/  — OO 

l—oo 

exp[-j(cjixi  +  002X2  H - h  ojnxn)} 


duo\  duo 2  du>N 


2i r  27 r 


27T 


(14.16) 


All  the  properties  of  the  2-dimensional  characteristic  function  extend  to  the  general 
case.  Note  that  once  4>Xi,x2,...,xn{^i^2^  •  •  •  ,wn)  is  known,  the  characteristic  func¬ 
tion  for  any  subset  of  the  X^s  is  found  by  setting  uoi  equal  to  zero  for  the  ones  not  in 
the  subset.  For  example,  to  find  pxux2(%i,  ^2)5  we  can  let  uo%  —  uo^  =  •  •  •  =  uon  =  0 
in  the  joint  characteristic  function  to  yield  (see  Problem  14.14) 


/OO  r  OO 

/  4>xux2,...,xN (wi, o>2,  0, . . . ,  0)  exp[-j(u>ixi  +W2Z2)] 

-OO  J  — OO  V  .  S/  .  * 


■OO  J  — OO 


duo  1  duo2 
2ir  2n 


<t>x1,x2  (^1^2) 


As  seen  previously,  the  joint  moments  can  be  obtained  from  the  characteristic  func¬ 
tion.  The  general  formula  is 

EXl,X2_XN[X['Xlj  ...Xl»] 


dll  +^2  +  ,'‘-MjV 


ih+l 2~\ — Hn  ,h  ,h 


dul1  dujl2  . . .  dJ» 


4>x UX2,...,XN  (wi,a;2,  •  •  •  ,un) 


UJ\  — (jJ2  —  ••• — UJ  tv  — 0 

(14.17) 


An  example  follows. 

Example  14.5  —  Second-order  joint  moments  for  multivariate  Gaussian 
PDF 

In  this  example  we  derive  the  second-order  moments  Ex{Xj  [A; Xj]  if  X  ~  V(0,  C). 
The  characteristic  function  can  be  shown  to  be  [Muirhead  1982] 


<£xM  =  exp  (  -~u>TCu 
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where  u>  =  [a>i  lo?  . .  ■  wn]T ■  We  first  let 


N  N 

Q(U))  =  U)T Cw  =  ^  (14.18) 

m— 1  n— 1 

and  note  that  it  is  a  quadratic  form  (see  Appendix  C).  Also,  we  let  [C]mn  =  cmn  to 
simplify  the  notation.  Then  from  (14.17)  with  li  =  l j  =  1  and  the  other  Vs  equal  to 
zero,  we  have 


1  5 ^ 

ExtjcAXiXj]  =  -a 


1 


j2  du>idu). 


exp 


QM 


u;=o 


Carrying  out  the  partial  differentiation  produces 


dexp[—  (l/2)Q(fa>)] 
doji 

d2exp[-(l/2)Q(u?)] 

dcoiduj 


ldQ(u)  (  , 

2-^rexp  -2Q(w) 


1  dQ( a>)  dQ(u) 
4  doji  duij 

1  d2Q{  u) 

2  duiidujj 


exp  |  --Q(w) 


exp  |  -~Q(u) 


(14.19) 


But 


dQ(  u)) 
5c 0; 


N  N 


LJ= 0 


doJmW 


m^n 


m— 1  n=l 


do;, 


'77171 


(from  (14.18)) 


cj=o 


N  N  r 

EE 

771=1  71=1  L 


5a?n  |  <  dijJjji  ^ 

OJnm  ~Z  Cmn  I"  CtJrj  ~  C, 


5a;, 


5a;, 


cj=o 


=  0 


(14.20) 


and  also 


But 


d2QM 

dojidojj 


N  N 


W=0 


d2u 


771^71 


du)idu)j 

771=1  71=1  J 


'77171 


U)= 0 


(14.21) 


d2u 


771^71 


du h 


duJn  d(xJm 

^77i  ~  r  o;9 


5a;, 


'71 


'i  duji 

^771  $7li  4“  ^71^7711 


where  Sij  is  the  Kronecker  delta,  which  is  defined  to  be  1  if  i  =  j  and  0  otherwise. 
Hence 

5  u)mwn 


—  4“  $nj  $mi 


duOidbOj 
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and  6mjSni  equals  1  if  (m,n)  =  (j,  i)  and  equals  0  otherwise,  and  Snj5mi  equals  1  if 
(ra,n)  =  (i,j)  and  equals  0  otherwise.  Thus, 

=  Cji  +  Cij  (from  (14.21)) 

w=o 

=  2 (recall  that  CT  =  C). 


Finally,  we  have  the  expected  result  from  (14.19)  and  (14.20)  that 


Exi'XjlXiXj] 


1  d2Q(u) 

2  duj{dujj  ^ 


CJ=0 


<> 

Lastly,  we  extend  the  characteristic  function  approach  to  determining  the  PDF  for 
a  sum  of  IID  random  variables.  Letting  Y  =  YliLi  the  characteristic  function  of 
Y  is  defined  by 

0yM  =  EY[exp(ju>Y)] 

and  is  evaluated  using  (14.7)  with  g(Xi,X2,  • . .  ,Xn)  =  exp[ju  YliLi  X{\  (the  real 
and  imaginary  parts  are  evaluated  as  separate  integrals)  as 


<for{u)  =  Ex  ux2,...,xN 


N 


exp  ju  ex‘ 


i=l 


Exltx2,...,xN 


N 


JJexp(jwXj) 


Li=i 


Now  using  the  fact  that  the  JQ’s  are  IID,  we  have  that 

N 

4>y  M  =  JJ  EXi  [exp  (juXi)]  (independence) 

i- 1 
N 

=  II  4>Xi  (w) 
i= 1 

=  [4x{u)]N  (identically  distributed) 

where  </>x(w)  is  the  common  characteristic  function  of  the  random  variables.  To 
finally  obtain  the  PDF  of  the  sum  random  variable  we  use  an  inverse  Fourier  trans¬ 
form  to  yield 

/°0  JJ.  . 

[<l>x(^)]N  exp{-juy)— .  (14.22) 

This  formula  will  form  the  basis  for  the  exploration  of  the  PDF  of  a  sum  of  IID 
random  variables  in  Chapter  15.  See  Problems  14.17  and  14.18  for  some  examples 
of  its  use. 
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14.7  Conditional  PDFs 

The  discussion  of  Section  9.7  of  the  definitions  and  properties  of  the  conditional  PMF 
also  hold  for  the  conditional  PDF.  To  accommodate  continuous  random  variables  we 
need  only  replace  the  PMF  notation  of  the  “bracket”  with  that  of  the  PDF  notation 
of  the  “parenthesis.”  Hence,  we  do  not  pursue  this  topic  further. 

14.8  Prediction  of  a  Random  Variable  Outcome 

We  have  seen  in  Section  7.9  that  the  optimal  linear  prediction  of  the  outcome  of  Y 
when  X  —  x  is  observed  to  occur  is 

Y  =  Ey[Y]  +  COv(f;^  (x  -  Ex[X]).  (14.23) 

var(A) 

If  (X,  Y )  has  a  bivariate  Gaussian  PDF,  then  the  linear  predictor  is  also  the  optimal 
predictor,  amongst  all  linear  and  nonlinear  predictors.  We  now  extend  these  results 
to  the  prediction  of  a  random  variable  after  having  observed  the  outcomes  of  several 
other  random  variables.  In  doing  so  the  orthogonality  principle  will  be  introduced. 
Our  discussions  will  assume  only  zero  mean  random  variables,  although  the  results 
are  easily  modified  to  yield  the  prediction  for  a  nonzero  mean  random  variable.  To 
do  so  note  that  (14.23)  can  also  be  written  as 

Y  -  Ey[Y]  =  C°V(f  ’  ^  (x  -  Ex[X}). 

var(X) 

But  if  X  and  Y  had  been  zero  mean,  then  we  would  have  obtained 

Y  c°v(X’Y) 
var(X) 

It  is  clear  that  the  modification  from  the  zero  mean  case  to  the  nonzero  mean  case 
is  to  replace  each  Xi  by  xi  —  Ex{ [Xi\  and  also  Y  by  Y  —  Ey[Y]. 

Now  consider  the  p+  1  continuous  random  variables  {Xi,X2, . . .  ,Xp,Xp+i} 
and  say  we  wish  to  predict  Xp+i  based  on  the  knowledge  of  the  outcomes  of 
Xi,  X2,  •  •  • ,  Xp.  Letting  X\  =  x\,X2  =  X2---,Xp  =  xp  be  those  outcomes,  we 
consider  the  linear  prediction 

p 

Xp+i  —  x><  (14.24) 

*= 1 

where  the  at's  are  the  linear  prediction  coefficients,  which  are  to  be  determined.  The 
optimal  coefficients  are  chosen  to  minimize  the  mean  square  error  (MSE) 

mse  =  Exux2,...,xp+i[(Xp+i  -  Xp+1)2] 
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or  written  more  explicitly  as 

(14.25) 

We  have  used  Ya=i  aiXi,  which  is  a  random  variable,  as  the  predictor  in  order  that 
the  error  measure  be  the  average  over  all  predictions.  If  we  now  differentiate  the 
MSE  with  respect  to  a\  we  obtain 

dEx i,x2,...,xp+i[(^p+i  ~  Ef=i  aixi)2] 

da\ 


mse  =  Ex  ux2,...,xp+1 


X. 


P+1 


V 


Ex  i,X2,.--,Xp+i 


d 


p 


da 


Ex  i,x2)...,x 


P+1 


2=1 

P 


(interchange  integration 
and  differentiation) 


-2(Xp+1-Y/alXi)X1 


i= 1 


=  0. 


(14.26) 


This  produces 


Ex^^Xr+AXlXp+l] 


P+1 


p 


li=l 


or 


p 


EXl,xp+1[XiXp+l]  =  YaiExuXi[X  \Xi], 


i=  1 


Letting  cij  =  Exi,Xj[XiXj)  denote  the  covariance  (since  the  X^s  are  zero  mean)  we 
have  the  equation 


p 


^  ^  cliai  —  cl,p+l* 


2—1 


If  we  differentiate  with  respect  to  the  other  coefficients,  similar  equations  are  ob¬ 
tained.  In  all,  there  will  be  p  simultaneous  linear  equations  given  by 

p 

^  ^  ^ki^i  ~  Q;,p+ 1  ^  —  1,  2,  .  .  .  ,p 


2—1 


that  need  to  be  solved  to  yield  the  ai  s.  These  equations  can  be  written  in  vec¬ 
tor/matrix  form  as 


c  ii 
021 


C12 

022 


C1  p 

&2p 


Cpi  Cp2 


'PP  J 


ai 

ci,p+i 

«2 

• 

• 

— 

c2,p+l 

• 

• 

• 

(ip 

■ 

_  CP,P+1  _ 

(14.27) 


c 


"V" 

C 
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We  note  that  C  is  the  covariance  matrix  of  the  random  vector  [X\  X2  .  . .  Xp]T  and 
c  is  the  vector  of  covariances  between  Xp+\  and  each  Xi  used  in  the  predictor. 
The  linear  prediction  coefficients  are  found  by  solving  these  linear  equations.  An 
example  follows. 

Example  14.6  -  Linear  prediction  based  on  two  random  variable  out¬ 
comes 

/V 

Consider  the  prediction  of  A3  based  on  the  outcomes  of  X\  and  X2  so  that  A3  = 
a\X\  +  0,2X2,  where  p  =  2.  If  we  know  the  covariance  matrix  of  X  =  [X\  X2  A3]-7" 
say  Cj,  then  all  the  c^-’s  needed  for  (14.27)  are  known.  Hence,  suppose  that 


’  Cll 

C12 

C13  ' 

1 

2/3 

1/3  ’ 

Cx  — 

02 1 

C22 

C23 

— 

2/3 

1 

2/3 

.  C31 

C32 

C33 

.  1/3 

2/3 

1 

Thus,  A3  is  correlated  with  X2  with  a  correlation  coefficient  of  2/3  and  A3  is 
correlated  with  Ai  but  with  a  smaller  correlation  coefficient  of  1/3.  Using  (14.27) 
with  p  —  2  we  must  solve 


1  2/3  ' 

"  0,1  " 

’  1/3  ‘ 

_  2/3  1 

a2 

.  2/3  . 

By  inverting  the  covariance  matrix  we  have  the  solution 


°lopt 

1 

1 

-2/3  ‘ 

■  1/3  • 

_  ^2opt  _ 

1  -  (2/3)2 

_  -2/3 

1 

.  2/3  . 

_  1 
5 

4 

5  . 

Due  to  the  larger  correlation  of  A3  with  A2,  the  prediction  coefficient  0,2  is  larger. 
Note  that  if  the  covariance  matrix  is  Cx  =  <r2I,  then  C13  =  C23  =  0  and  01  t  = 

/V 

a2opt  —  0*  This  results  in  A3  =  0  or  more  generally  for  random  variables  with 

A 

nonzero  means,  A3  =  Ex3[X 3],  as  one  might  expect.  See  also  Problem  14.24  to  see 
how  to  determine  the  minimum  value  of  the  MSE. 

0 

As  another  simple  example,  observe  what  happens  if  p  =  1  so  that  we  wish  to 

A 

predict  X2  based  on  the  outcome  of  X\ .  In  this  case  we  have  that  X2  =  a  1  x  \  and 
from  (14.27),  the  solution  for  a\  is  oiopt  =  cn/cn  =  cov(Xi,X2)/vax(Xi).  Hence, 

A 

A2  =  [cov(Ai,  A2)/var(Ai)]#i  and  we  recover  our  previous  results  for  the  bivariate 
case  (see  (14.23)  and  let  Ex[ A]  =  Ey[Y]  =  0)  by  replacing  Ai  with  A,  x\  with  £, 
and  A2  with  Y. 

An  interesting  and  quite  useful  interpretation  of  the  linear  prediction  procedure 
can  be  made  by  reexamining  (14.26).  To  simplify  the  discussion  let  p  =  2  so  that 
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the  equations  to  be  solved  are 

Ex  i,x2,x3[(-^3  —  aiXi  —  c^-X^Xi]  =  0 

EXux2,x3[(X3  ~  -  a2X2)X2\  =  0.  (14.28) 

Let  the  predictor  error  be  denoted  by  e,  which  is  explicitly  e  =  —  a\ X\  —  a^X^. 

Then  (14.28)  becomes 


Ex  i,x2,Xs[e^i]  =  0 

ExuX2,x3[^X2]  =  0  (14.29) 

which  says  that  the  optimal  prediction  coefficients  ai,a2  are  found  by  making  the 
predictor  error  uncorrelated  with  the  random  variables  used  to  predict  X%.  Presum¬ 
ably  if  this  were  not  the  case,  then  some  correlation  would  remain  between  the  error 
and  Xi,X2,  and  this  correlation  could  be  exploited  to  reduce  the  error  further  (see 
Problem  14.23). 

A  geometric  interpretation  of  (14.29)  becomes  apparent  by  considering  Xi,  X2, 

_  /V 

and  X3  as  vectors  in  a  Euclidean  space  as  depicted  in  Figure  14.3a.  Since  X3  = 


Figure  14.3:  Geometrical  interpretation  of  linear  prediction. 


a\X\  +  0,2X2,  X3  can  be  any  vector  in  the  shaded  region,  which  is  the  X1-X2  plane, 
depending  upon  the  choice  of  a\  and  a2.  To  minimize  the  error  we  should  choose 

A 

X3  as  the  orthogonal  projection  onto  the  plane  as  shown  in  Figure  14.3b.  But  this 
is  equivalent  to  making  the  error  vector  e  orthogonal  to  any  vector  in  the  plane.  In 
particular,  then  we  have  the  requirement  that 


e  1  Xi 
e  JL  X2 


(14.30) 
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where  _L  denotes  orthogonality.  To  relate  these  conditions  back  to  those  of  (14.29)  we 
define  two  zero  mean  random  variables  X  and  Y  to  be  orthogonal  if  Ex,y[XY]  =  0. 
Hence,  we  have  that  (14.30)  is  equivalent  to 

Ex  ux2,x3[eXi]  =  0 
Ex  i,X2,xs[eX2]  =  0 

or  just  the  condition  given  by  (14.29).  (Since  e  depends  on  (Xi,^,-^),  the  ex¬ 
pectation  reflects  this  dependence.)  This  is  called  the  orthogonality  principle.  It 
asserts  that  to  minimize  the  MSE  the  error  “vector”  should  be  orthogonal  to  each  of 
the  “data  vectors”  used  to  predict  the  desired  “vector”.  The  “vectors”  X  and  Y  are 
defined  to  be  orthogonal  if  Ex,y  [XY]  =  0,  which  is  equivalent  to  being  uncorrelated 
since  we  have  assumed  zero  mean  random  variables.  See  also  Problem  14.22  for  the 
one-dimensional  case  of  the  orthogonality  principle. 

14.9  Computer  Simulation  of  Gaussian 
Random  Vectors 

The  method  described  in  Section  12.11  for  generating  a  bivariate  Gaussian  random 
vector  is  easily  extended  to  the  IV-dimensional  case.  To  generate  a  realization  of 
X  ~  A/"(/x,  C)  we  proceed  as  follows: 

1.  Perform  a  Cholesky  decomposition  of  C  to  yield  the  N  x  N  nonsingular  matrix 

G,  where  C  =  GGT. 

2.  Generate  a  realization  u  of  an  N  x  1  random  vector  U  whose  PDF  is  JV"(0,I). 

3.  Form  the  realization  of  X  as  x  =  Gu  +  p. 

As  an  example,  if  [i  —  0  and 

1  2/3  1/3  ' 

C  =  2/3  1  2/3  (14.31) 

_  1/3  2/3  1 

then 

1  0  0 
0.6667  0.7454  0 

0.3333  0.5963  0.7303  _ 

We  plot  100  realizations  of  X  in  Figure  14.4.  The  MATLAB  code  is  given  next. 
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Figure  14.4:  Realizations  of  3  x  1  multivariate  Gaussian  random  vector. 

C=[l  2/3  l/3;2/3  1  2/3;l/3  2/3  1]; 

C^holCC)*;  °/0  perform  Cholesky  decomposition 

7.  MATLAB  produces  C=A'*A  so  G=A> 

M=200; 

for  m=l:M  °/0  generate  realizations  of  x 
u= [randn(l , 1)  randn(l,l)  randn(l , 1)] 3  ; 

x(:,m)=G*u;  °/0  realizations  stored  as  columns  of  3  x  200  matrix 
end 


14.10  Real-World  Example  —  Signal  Detection 

An  important  problem  in  sonar  and  radar  is  to  be  able  to  determine  when  an  object, 
such  as  a  submarine  in  sonar  or  an  aircraft  in  radar,  is  present.  To  make  this  decision 
a  pulse  is  transmitted  into  the  water  (sonar)  or  air  (radar)  and  one  looks  to  see  if  a 
reflected  pulse  from  the  object  is  returned.  Typically,  a  digital  computer  is  used  to 
sample  the  received  waveform  in  time  and  store  the  samples  in  memory  for  further 
processing.  We  will  denote  the  received  samples  as  Xi,  X<i,  •  •  • ,  Xjy.  If  there  is  no 
reflection,  indicating  no  object  is  present,  the  received  samples  are  due  to  noise  only. 
If,  however,  there  is  a  reflected  pulse,  also  called  an  echo ,  the  received  samples  will 
consist  of  a  signal  added  to  the  noise.  A  standard  model  for  the  received  samples  is  to 
assume  that  X{  —  W{ ,  where  W{  ~  Af( 0,  a2)  for  noise  only  present  and  X{  —  S{  +  W{ 
for  a  signal  plus  noise  present.  The  noise  samples  W{  are  usually  also  assumed  to  be 
independent  and  hence  they  are  IID.  With  this  modeling  we  can  formulate  the  signal 
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detection  problem  as  the  problem  of  deciding  between  the  following  two  hypotheses 


Rw 


Rs+w 


Xi  =  Wi  i  =  1,2, . . . ,  AT 

X{  —  S{  “h  Wi  i  =  1, 2, . . . ,  N. 


It  can  be  shown  that  a  good  decision  procedure  is  to  choose  the  hypothesis  for  which 
the  received  data  samples  have  the  highest  probability  of  occurring.  In  other  words, 
if  the  received  data  is  more  probable  when  Rs+w  is  true  than  when  Rw  is  true, 
we  say  that  a  signal  is  present.  Otherwise,  we  decide  that  noise  only  is  present.  To 
implement  this  approach  we  let  pxfcRw)  be  the  PDF  when  noise  only  is  present 
and  px(x',Rs+w)  be  the  PDF  when  a  signal  plus  noise  is  present.  Then  we  decide 
a  signal  is  present  if 

Px(x;  Rs+w)  >  Px(x;  Rw )•  (14.32) 

But  from  the  modeling  we  have  that  X  =  W  ~  «A7(0,  cr2I)  for  no  signal  present  and 
X  =  s  +  W  ~  A/*(s,  a2 1)  when  a  signal  is  present.  Here  we  have  defined  the  signal 
vector  as  s  =  [si  $2  ■  •  •  sn]T-  Hence,  (14.32)  becomes  from  (14.2) 


(27 rcr2) 


7T  exP 
2 


2a2 


(x  —  s)T(x  —  s) 


> 


1 


(27TCT2) 


2\W  6XP 


2cr2 


T 

X  X 


An  equivalent  inequality  is 


—  (x  —  s)T(x  —  s)  >  — xTx 

since  the  constant  l/(27rcr2)N/2  is  positive  and  the  exponential  function  increases 
with  its  argument.  Expanding  the  terms  we  have 

nr»  n~>  rT1  nn  m 

— x  x  +  x  s  +  s  x  —  s  s  >  —x  X 


and  since  sTx  =  xTs  we  have 


t  It 
x  s  >  -s  s 


or  finally  we  decide  a  signal  is  present  if 


(14.33) 


This  detector  is  called  a  replica  correlator  [Kay  1998]  since  it  correlates  the  data 
#i,  ^2 5  5  %n  with  a  replica  of  the  signal  $i,  $2, . . . ,  sjv-  quantity  on  the  right- 

hand-side  of  (14.33)  is  called  the  threshold.  If  the  value  of  Y^iLi  xisi  exceeds  the 
threshold,  the  signal  is  declared  as  being  present. 

As  an  example,  assume  that  the  signal  is  a  “DC  level”  pulse  or  si  =  A  for 
i  =  1, 2, . . . ,  N  and  that  A  >  0.  Then  (14.33)  reduces  to 
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and  since  A  >  0,  we  decide  a  signal  is  present  if 


1 

N 


N  A 

A 

2^Xi  >  y* 

i= 1 


Hence,  the  sample  mean  is  compared  to  a  threshold  of  A/2.  To  see  how  this  detector 
performs  we  choose  A  =  0.5  and  a2  =  1.  The  received  data  samples  are  shown  in 
Figure  14.5a  for  the  case  of  noise  only  and  in  Figure  14.5b  for  the  case  of  a  signal  plus 
noise.  A  total  of  100  received  data  samples  are  shown.  Note  that  the  noise  samples 


(a)  Noise  only  (b)  Signal  plus  noise 


Figure  14.5:  Received  data  samples.  Signal  is  Si  =  A  =  0.5  and  noise  consists  of 
IID  standard  Gaussian  random  variables. 

generated  are  different  for  each  figure.  The  value  of  the  sample  mean  (1  /N)  YliLi  xi 
is  shown  in  Figure  14.6  versus  the  number  of  data  samples  N  used  in  the  averaging. 
For  example,  if  N  —  10,  then  the  value  shown  is  (1/10)  J2]ti  xi->  where  x%  is  found 
from  the  first  10  samples  of  Figure  14.5.  To  more  easily  observe  the  results  they 
have  been  plotted  as  a  continuous  curve  by  connecting  the  points  with  straight  lines. 
Also,  the  threshold  of  A/2  =  0.25  is  shown  as  the  dashed  line.  It  is  seen  that  as  the 
number  of  data  samples  averaged  increases,  the  sample  mean  converges  to  the  mean 
of  Xi  (see  also  Example  14.4).  When  noise  only  is  present,  this  becomes  Ex[X]  —  0 
and  when  a  signal  is  present,  it  becomes  Ex[X]  —A  —  0.5.  Thus  by  comparing  the 
sample  mean  to  the  threshold  of  A/2  =  0.25  we  should  be  able  to  decide  if  a  signal 
is  present  or  not  most  of  the  time  (see  also  Problem  14.26). 
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(a)  Total  view 


(b)  Expanded  view  for  70  <  N  <  100 


Figure  14.6:  Value  of  sample  mean  versus  the  number  of  data  samples  averaged. 
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Problems 


14.1  (^)  (w,f)  If  Y  =  X\  +  X2  +  -X3,  where  X  ~  V(/i,  C)  and 


2 

3  . 

1  1/2  1/4 

1/2  1  1/2 

1/4  1/2  1 


find  the  mean  and  variance  of  Y. 


14.2  (w,c)  If  [Xi  X2Y  ~  A/’(0,cr2I),  find  P[X 2  +  X\  >  R2].  Next,  let  a2  =  1  and 
-R  =  1  and  lend  credence  to  your  result  by  performing  a  computer  simulation 
to  estimate  the  probability. 
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14.3  (f)  Find  the  PDF  of  Y  =  Xf  +  X\  +  X\  if  X  ~  V(0, 1).  Hint:  Use  the  results 
of  Example  14.1.  Note  that  you  should  obtain  the  PDF  for  a  xl  random 
variable. 

14.4  (w)  An  airline  has  flights  that  depart  according  to  schedule  95%  of  the  time. 
This  means  that  they  depart  late  1/2  hour  or  more  5%  of  the  time  due  to 
mechanical  problems,  traffic  delays,  etc.  (for  less  than  1/2  hour  the  plane  is 
considered  to  be  “on  time”).  The  amount  of  time  that  the  plane  is  late  is 
modeled  as  an  exp(A)  random  variable.  If  a  person  takes  a  plane  that  makes 
two  stops  at  intermediate  destinations,  what  is  the  probability  that  he  will 
be  more  than  1  1/2  hours  late?  Hint:  You  will  need  the  PDF  for  a  sum  of 
independent  exponential  random  variables. 

14.5  (f)  Consider  the  transformation  from  spherical  to  Cartesian  coordinates.  Show 
that  the  Jacobian  has  a  determinant  whose  absolute  value  is  equal  to  r2  sin  </>. 

14.6  (o)  (w)  A  large  group  of  college  students  have  weights  that  can  be  modeled 
as  a  Af{  150,30)  random  variable.  If  4  students  are  selected  at  random,  what 
is  the  probability  that  they  will  all  weigh  more  than  150  lbs? 

14.7  (t)  Prove  that  the  joint  PDF  given  by  (14.4)  has  A/"(0, 1)  marginal  PDFs  and 
that  the  random  variables  are  uncorrelated.  Hint:  Use  the  known  properties 
of  the  standard  bivariate  Gaussian  PDF. 

14.8  (t)  Assume  that  X  ~  J\f( 0,  C)  for  X  an  N  x  1  random  vector  and  that  Y  = 
GX,  where  G  is  an  M  x  N  matrix  with  M  <  N.  If  the  characteristic  function 
of  X  is  </>x(kj)  —  exp  (— ^wTCu;),  find  the  characteristic  function  of  Y.  Use 
the  following 

=  £Y[exp(ju,TY)]  =  £x[exp(ju;TGX)]  =  £x[exp(j(GTu;)TX)]. 
Based  on  your  results  conclude  that  Y  ~  Af( 0,  GCGT). 

14.9  (0)  (f )  If  Y  =  Xi  +  X2  +  X3,  where  X  -  Af{ 0,  C)  and  C  =  diag(a2,  o\,  af ), 
find  the  PDF  of  Y.  Hint:  See  Problem  14.8. 

14.10  (f)  Show  that  if  Cx  is  a  diagonal  matrix,  then  aTCxa  =  J2iLi  o^vai(Xi). 

14.11  (c)  Simulate  a  single  realization  of  a  random  vector  composed  of  IID  random 
variables  with  PDF  X{  ~  Af(  1, 2)  for  i  =  1, 2, . . . ,  N.  Do  this  by  repeating  an 
experiment  that  successively  generates  X  ~  Af(  1,2).  Then,  find  the  outcome 
of  the  sample  mean  random  variable  and  discuss  what  happens  as  N  becomes 
large. 

14.12  (^)  (w,c)  An  AT x  1  random  vector  X  has  Ex{  [Xi]  =  /jl  and  var(X^)  =  ic r2  for 
z  =  1,2,...,  AT.  The  components  of  X  are  independent.  Does  the  sample  mean 
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random  variable  converge  to  n  as  N  becomes  large?  Carry  out  a  computer 
simulation  for  this  problem  and  explain  your  results. 

14.13  (t)  Prove  that  if  X\,  X2, . . .  ,Xn  are  independent  random  variables,  then 

=  nil  <t>xM)- 

14.14  (t)  Prove  that  (/>Xi,x2,. ..,xN(^i, ^2,0,0 ...  ,0)  =  4>Xi,x2{u\, w2). 

14.15  (t)  If  X  ~  A/*(/x,  C)  with  X  an  N  x  1  random  vector,  prove  that  the  charac¬ 
teristic  function  is 


<£xM  =  exp 


nr  1  rp 

joj  /jt  —  -u)  Ccj 


To  do  so  note  that  the  characteristic  function  of  a  random  vector  distributed 
according  to  A/*(0,  C)  is  exp  (— ^uTCu>).  With  these  results  show  that  the 
PDF  of  Y  —  GX  for  G  an  M  x  iV  matrix  with  M  <  N  is  Af(G/x,  GCGT). 

14.16  (t)  Prove  that  if  X  ~  A/*(/x,  C)  for  X  an  N  x  1  random  vector,  then  the 
marginal  PDFs  are  X{  ~  A/(^,cr?).  Hint:  Examine  the  PDF  of  Y  —  e^X, 
where  is  the  N  x  1  vector  whose  elements  are  all  zeros  except  for  the  ith 
element,  which  is  a  one.  Also,  make  use  of  the  results  of  Problem  14.15. 

14.17  (f)  Prove  that  if  Xi  ~  A/*(0, 1)  for  i  =  1, 2 . . . ,  N  and  the  X^s  are  IID,  then 

Xf  ~  To  do  so  first  find  the  characteristic  function  of  Xf.  Hint: 
You  will  need  the  result  that 

f°°  — p=  exp  — )  dx  =  1 
i-oo  y/2 Vc  V  2  c  ; 

for  c  a  complex  number.  Also,  see  Table  11.1. 

14.18  (t)  Prove  that  if  Xi  ~  exp(A)  and  the  X{S  are  IID,  then  has  an 

Erlang  PDF.  Hint:  See  Table  11.1. 

14.19  (o)  (w?c)  Find  the  mean  and  variance  of  the  random  variable 


12 

y  =  - 1/2) 

2—1 

where  U{  ~  U{ 0,1)  and  the  are  IID.  Estimate  the  PDF  of  Y  using  a 
computer  simulation  and  compare  it  to  a  standard  Gaussian  PDF.  See  Section 
15.5  for  a  theoretical  justification  of  your  results. 
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14.20  (w)  Three  different  voltmeters  measure  the  voltage  of  a  100  volt  source.  The 
measurements  can  be  modeled  as  random  variables  with 


Vi  ~  V(100,l) 

V2  ~  V(100,10) 

V3  ~  JV(100,5). 

Is  it  better  to  average  the  results  or  just  use  the  most  accurate  voltmeter? 

14.21  (0)  (f)  If  a  3  x  1  random  vector  has  mean  zero  and  covariance  matrix 


3  2  1 
2  3  2 
12  3 


find  the  optimal  prediction  of  X3  given  that  we  have  observed  X\  —  1  and 

X2  =  2. 


14.22  (t)  Consider  the  prediction  of  the  random  variable  Y  based  on  observing  that 
X  —  x.  Assuming  (X,  Y)  is  a  zero  mean  random  vector,  we  propose  using  the 

A 

linear  prediction  Y  —  ax.  Determine  the  optimal  value  of  a  (being  the  value 
that  minimizes  the  MSE)  by  using  the  orthogonality  principle.  Explain  your 
results  by  drawing  a  diagram. 

14.23  (f)  If  a  3  x  1  random  vector  X  has  a  zero  mean  and  covariance  matrix 


1  p  p2  " 
pip 

P2  P  1 


determine  the  optimal  linear  prediction  of  X3  based  on  the  observed  outcomes 
of  X\  and  X*i-  Why  is  aiopt  =  0?  Hint:  Consider  the  covariance  between 
e  =  X3  —  pX 2,  which  is  the  predictor  error  for  X3  based  on  observing  only  X2, 
and  X\. 


14.24  (s^/)  (t,f)  Explain  why  the  minimum  MSE  of  the  predictor  X3  =  aioptX\  + 


^2opt  -^2 


msemin  =  Ex ux2,x3  [(^3  ~  «i0Pt^i  -  «2opt^2)2] 

=  EXl,x2,X3  [(^3  -  aiopt^i  -  a2optX2)X3] 

—  C33  -  aioptci3  -  a2optc23- 


Next  use  this  result  to  find  the  minimum  MSE  for  Example  14.6. 


PROBLEMS 
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14.25  (^)  (c)  Use  a  computer  simulation  to  generate  realizations  of  the  random 
vector  X  described  in  Example  14.6.  Then,  predict  X3  based  on  the  outcomes 
of  X\  and  X2  and  plot  the  true  realizations  and  the  predictions.  Finally, 
estimate  the  average  predictor  error  and  compare  your  results  to  the  theoretical 
minimum  MSE  obtained  in  Problem  14.24. 

14.26  (w)  For  the  signal  detection  example  described  in  Section  14.9  prove  that 
the  probability  of  saying  a  signal  is  present  when  indeed  there  is  one  goes  to 
1  as  A  — »  00. 

14.27  (c)  Generate  on  a  computer  1000  realizations  of  the  two  different  random 

variables  Xw  ~  A/*(0, 1)  and  Xs+w  ~  ^(0.5,1).  Next  plot  the  outcomes  of 
the  sample  mean  random  variable  versus  AT,  the  number  of  successive  samples 
averaged,  or  xjy  =  (1  /N)  xi •  What  can  you  say  about  the  sample  means 

as  N  becomes  large?  Explain  what  this  has  to  do  with  signal  detection. 


Chapter  15 


Probability  and  Moment 
Approximations  Using  Limit 
Theorems 

15.1  Introduction 

So  far  we  have  described  the  methods  for  determining  the  exact  probability  of  events 
using  probability  mass  functions  (PMFs)  for  discrete  random  variables  and  proba¬ 
bility  density  functions  (PDFs)  for  continuous  random  variables.  Also  of  importance 
were  the  methods  to  determine  the  moments  of  these  random  variables.  The  proce¬ 
dures  employed  were  all  based  on  knowledge  of  the  PMF/PDF  and  the  implementa¬ 
tion  of  its  summation/integration.  In  many  practical  situations  the  PMF/PDF  may 
be  unknown  or  the  summation/integration  may  not  be  easily  carried  out.  It  would 
be  of  great  utility,  therefore,  to  be  able  to  approximate  the  desired  quantities  using 
much  simpler  methods.  For  random  variables  that  are  the  sum  of  a  large  number  of 
independent  and  identically  distributed  random  variables  this  can  be  done.  In  this 
chapter  we  focus  our  discussions  on  two  very  powerful  theorems  in  probability — the 
law  of  large  numbers  and  the  central  limit  theorem.  The  first  theorem  asserts  that 
the  sample  mean  random  variable,  which  is  the  average  of  IID  random  variables  and 
which  was  introduced  in  Chapter  14,  converges  to  the  expected  value ,  a  number,  of 
each  random  variable  in  the  average.  The  law  of  large  numbers  is  also  known  collo¬ 
quially  as  the  law  of  averages.  Another  reason  for  its  importance  is  that  it  provides 
a  justification  for  the  relative  frequency  interpretation  of  probability.  The  second 
theorem  asserts  that  a  properly  normalized  sum  of  IID  random  variables  converges 
to  a  Gaussian  random  variable. 

The  theorems  are  actually  the  simplest  forms  of  much  more  general  results. 
For  example,  the  theorems  can  be  formulated  to  handle  sums  of  nonidentically 
distributed  random  variables  [Rao  1973]  and  dependent  random  variables  [Brockwell 
and  Davis  1987]. 


486 


CHAPTER  15.  PROBABILITY  AND  MOMENT  APPROXIMATIONS 


15.2  Summary 

The  Bernoulli  law  of  large  number  is  introduced  in  Section  15.4  as  a  prelude  to  the 
more  general  law  of  large  numbers.  The  latter  is  summarized  in  Theorem  15.4.1  and 
asserts  that  the  sample  mean  random  variable  of  IID  random  variables  will  converge 
to  the  expected  value  of  a  single  random  variable.  The  central  limit  theorem  is 
described  in  Section  15.5  where  it  is  demonstrated  that  the  repeated  convolution  of 
PDFs  produces  a  Gaussian  PDF.  For  continuous  random  variables  the  central  limit 
theorem,  which  asserts  that  the  sum  of  a  large  number  of  IID  random  variables  has  a 
Gaussian  PDF,  is  summarized  in  Theorem  15.5.1.  The  precise  statement  is  given  by 
(15.6).  For  the  sum  of  a  large  number  of  IID  discrete  random  variables  it  is  the  CDF 
that  converges  to  a  Gaussian  CDF.  Theorem  15.5.2  is  the  central  limit  theorem  for 
discrete  random  variables.  The  precise  statement  is  given  by  (15.9).  The  concept 
of  confidence  intervals  is  introduced  in  Section  15.6.  A  95%  confidence  interval  for 
the  sample  mean  estimate  of  the  parameter  p  of  a  Ber(p)  random  variable  is  given 
by  (15.14).  It  is  then  applied  to  the  real-world  problem  of  opinion  polling. 


15.3  Convergence  and  Approximation  of  a  Sum 


Since  we  will  be  dealing  with  the  sum  of  a  large  number  of  random  variables,  it  is 
worthwhile  first  to  review  some  concepts  of  convergence.  In  particular,  we  need  to 
understand  the  role  that  convergence  plays  in  approximating  the  behavior  of  a  sum 
of  terms.  As  an  illustrative  example,  consider  the  determination  of  the  value  of  the 
sum 


sn 


for  some  large  value  of  N.  We  have  purposedly  chosen  a  sum  that  may  be  evaluated 
in  closed  form  to  allow  a  comparison  to  its  approximation.  The  exact  value  can  be 
found  as 


sn  = 


1  + 


1  a 


a 


N+l 


N  1 


a 


Examples  of  sn  versus  N  are  shown  in  Figure  15.1.  The  values  of  sn  have  been 
connected  by  straight  lines  for  easier  viewing.  It  should  be  clear  that  as  N  -»  oo, 
5jv-^lif|a|<l.  This  means  that  if  N  is  sufficiently  large,  then  sn  will  differ  from  1 
by  a  very  small  amount.  This  small  amount,  which  is  the  error  in  the  approximation 
of  sn  by  1,  is  given  by 

la  —  aN+1 
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Figure  15.1:  Convergence  of  sum  to  1. 


and  will  depend  on  a  as  well  as  N.  For  example,  if  we  wish  to  claim  that  the  error 
is  less  than  0.1,  then  N  would  have  to  be  10  for  a  =  0.5  but  N  would  need  to  be  57 
for  a  =  0.85,  as  seen  in  Figure  15.1.  Thus,  in  general  the  error  of  the  approximation 
will  depend  upon  the  particular  sequence  (value  of  a  here).  We  can  assert,  without 
actually  knowing  the  value  of  a  as  long  as  \a\  <  1  and  hence  the  sum  converges,  that 
sj\[  will  eventually  become  close  to  1.  The  error  can  be  quite  large  for  a  fixed  value 
of  N  (consider  what  would  happen  if  a  —  0.999).  Such  are  the  advantages  (sum  will 
be  close  to  1  for  all  |a|  <  1)  and  disadvantages  (how  large  does  N  have  to  be?)  of 
limit  theorems.  We  next  describe  the  law  of  large  numbers. 


15.4  Law  of  Large  Numbers 

When  we  began  our  study  of  probability,  we  argued  that  if  a  fair  coin  is  tossed  N 
times  in  succession,  then  the  relative  frequency  of  heads,  i.e.,  the  number  of  heads 
observed  divided  by  the  number  of  coin  tosses,  should  be  close  to  1/2.  This  was 
why  we  intuitively  accepted  the  assignment  of  a  probability  of  1  /2  to  the  event  that 
the  outcome  of  a  fair  coin  toss  would  be  a  head.  If  we  continue  to  toss  the  coin, 
then  as  N  — >  oo,  we  expect  the  relative  frequency  to  approach  1/2.  We  can  now 
prove  that  this  is  indeed  the  case  under  certain  assumptions.  First  we  model  the 
repeated  coin  toss  experiment  as  a  sequence  of  N  Bernoulli  subexperiments  (see 
also  Section  4.6.2).  The  result  of  the  ith  subexperiment  is  denoted  by  the  discrete 
random  variable  X where 


Xi  = 


1  if  heads 
0  if  tails. 
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We  can  then  model  the  overall  experimental  output  by  the  random  vector  X  = 
[X\  X2  . .  ■  Xn]t.  We  next  assume  that  the  discrete  random  variables  X{  are  IID 
with  marginal  PMF 


or  the  experiment  is  a  sequence  of  independent  and  identical  Bernoulli  subexperi¬ 
ments.  Finally,  the  relative  frequency  is  given  by  the  sample  mean  random  variable 


1  N 

i= 1 


(15.1) 


which  was  introduced  in  Chapter  14,  although  there  it  was  used  for  the  average  of 
continuous  random  variables.  We  subscript  the  sample  mean  random  variable  by  N 
to  remind  us  that  N  coin  toss  outcomes  are  used  in  its  computation.  Now  consider 
what  happens  to  the  mean  and  variance  of  Xjy  as  N  — »  00.  The  mean  is 


1  N 

Ex[XN]  =  —  ^  #x [Xi] 


i— 1 
N 


= 


i= 1 
N 


1  vl 

N  "  2 
1=1 


=  -  for  all  N. 
2 


The  variance  is 


var(Xjv)  =  var 


(Xj’s  are  independent  =>  uncorrelated) 


N 

i=  1 
var  (Xi) 
N 


But  for  a  Bernoulli  random  variable,  Xi  ~  Ber(p),  the  variance  is  var  (Xi) 
Since  p  =  1/2  for  a  fair  coin, 


(X^’s  are  identically  distributed 

have  same  variance). 


p{i-p) 


var(Xjv)  = 


P{  1  -p) 


N 


1 


4  N 


— >  0  as  N  00. 
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Therefore  the  width  of  the  PMF  of  Xn  must  decrease  as  N  increases  and  eventually 
go  to  zero.  Since  the  variance  is  defined  as 

var(Xjv)  -  E x  [{XN  -  £x[Xjv])2’ 

we  must  have  that  as  N  oo,  Xn  — >  Ex[Xn]  =  1/2.  In  effect  the  random 
variable  Xn  becomes  not  random  at  all  but  a  constant.  It  is  called  a  degenerate 
random  variable.  To  further  verify  that  the  PMF  becomes  concentrated  about  its 
mean,  which  is  1/2,  we  note  that  the  sum  of  N  IID  Bernoulli  random  variables  is  a 
binomial  random  variable.  Thus, 


N 

SN  =  ~bin 

i—1 


and  therefore  the  PMF  is 

PsN[k\  =  Q'j  k  =  0,l,...,N. 

To  find  the  PMF  of  Xn  we  let  Xn  —  (1/JV)  Xi  =  Sn/N  and  note  that  Xn 
can  take  on  values  u &  =  k/N  for  k  =  0, 1, . . . ,  N.  Therefore,  using  the  formula  for 
the  transformation  of  a  discrete  random  variable,  the  PMF  becomes 

PJt>fc]=  (^J  Q)  uk  =  k/N-  k  =  0,l,...,N  (15.2) 

which  is  plotted  in  Figure  15.2  for  various  values  of  N.  Because  as  N  increases  Xn 


(a)  N  =  10 


(b)  N  =  30 


(c)  N  =  100 


Figure  15.2:  PMF  for  sample  mean  random  variable  of  N  IID  Bernoulli  random 
variables  with  p  =  1/2.  It  models  the  relative  frequency  of  heads  obtained  for  N 
fair  coin  tosses. 
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takes  on  values  more  densely  in  the  interval  [0, 1],  we  do  not  obtain  a  PMF  with 
all  its  mass  concentrated  at  0.5,  as  we  might  expect.  Nonetheless,  the  probability 
that  the  sample  mean  random  variable  will  be  concentrated  about  1/2  increases. 
As  an  example,  the  probability  of  being  within  the  interval  [0.45, 0.55]  is  0.2461  for 
N  =  10,  0.4153  for  N  =  30,  and  0.7287  for  N  =  100,  as  can  be  verified  by  summing 
the  values  of  the  PMF  over  this  interval.  Usually  it  is  better  to  plot  the  CDF  since 
as  N  -»  oo,  it  can  be  shown  to  converge  to  the  unit  step  beginning  at  u  —  0.5 
(see  Problem  15.1).  Also,  it  is  interesting  to  note  that  the  PMF  appears  Gaussian, 
although  it  changes  in  amplitude  and  width  for  each  N .  This  is  an  observation  that 
we  will  focus  on  later  when  we  discuss  the  central  limit  theorem.  The  preceding 
results  say  that  for  large  enough  N  the  sample  mean  random  variable  will  always 
yield  a  number ,  which  in  this  case  is  1/2.  By  “always”  we  mean  that  every  time  we 
perform  a  repeated  Bernoulli  experiment  consisting  of  N  independent  and  fair  coin 
tosses,  we  will  obtain  a  sample  mean  of  1/2,  for  N  large  enough.  As  an  example, 
we  have  plotted  in  Figure  15.3  five  realizations  of  the  sample  mean  random  variable 
or  xn  versus  N.  The  values  of  have  been  connected  by  straight  lines  for  easier 
viewing.  We  see  that 


Figure  15.3:  Realizations  of  sample  mean  random  variable  of  N  IID  Bernoulli  ran¬ 
dom  variables  with  p  =  1/2  as  N  increases. 

XN  -»  1  =  Ex[X}.  (15.3) 

This  is  called  the  Bernoulli  law  of  large  numbers ,  and  is  known  to  the  layman  as 
the  law  of  averages.  More  generally  for  a  Bernoulli  subexperiment  with  probability 
p,  we  have  that 

XN^p  =  Ex[X]. 


The  sample  mean  random  variable  converges  to  the  expected  value  of  a  single  ran¬ 
dom  variable.  Note  that  since  Xx  is  the  relative  frequency  of  heads  and  p  is  the 
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probability  of  heads,  we  have  shown  that  the  probability  of  a  head  in  a  single  coin 
toss  can  be  interpreted  as  the  value  obtained  as  the  relative  frequency  of  heads  in  a 
large  number  of  independent  and  identical  coin  tosses.  This  observation  also  justifies 
our  use  of  the  sample  mean  random  variable  as  an  estimator  of  a  moment  since 

_  i  N 

Ex[X]  =  -J2xi~+Ex[X]  as  IV  oo 
V  i=  1 

and  more  generally,  justifies  our  use  of  (1/N)  YliLi  as  an  estimate  of  the  nth 
moment  E[Xn ]  (see  also  Problem  15.6). 

A  more  general  law  of  large  numbers  is  summarized  in  the  following  theorem.  It 
is  valid  for  the  sample  mean  of  IID  random  variables,  either  discrete,  continuous,  or 
mixed. 

Theorem  15.4.1  (Law  of  Large  Numbers)  If  X2, . . . , -Xjv  are  IID  random 
variables  with  mean  Ex[X\  and  var(X)  =  a2  <  00,  then  hmjv-»<x>Xv  =  Ex[X]. 

Proof: 

Consider  the  probability  of  the  sample  mean  random  variable  deviating  from  the 
expected  value  by  more  than  e,  where  e  is  a  small  positive  number.  This  probability 
is  given  by 

P  [|Xiv  -  Ex[X\ |  >  e]  =  P  p*  -  Ex[XN]\  >  e]  . 

Since  var ( Xx )  =  a 2/N,  we  have  upon  using  Chebyshev’s  inequality  (see  Section 

11.8) 

P  [|X„  -  EX[X] |  >  £]  <  =  X 

and  taking  the  limit  of  both  sides  yields 

2 

lim  P  Pat  -  Ex[X] \  >  el  <  lim  -^  =  0. 

AT— >oo  L  L  Jl  J  _  iv->oo  Ne2 

Since  a  probability  must  be  greater  than  or  equal  to  zero,  we  have  finally  that 

lirn^P  Pat  -Ex[X] |  >  e]  =  0  (15.4) 

which  is  the  mathematical  statement  that  the  sample  mean  random  variable  con¬ 
verges  to  the  expected  value  of  a  single  random  variable. 

□ 

The  limit  in  (15.4)  says  that  for  large  enough  N,  the  probability  of  the  error 
in  the  approximation  of  Xp?  by  Ex  [X]  exceeding  e  (which  can  be  chosen  as  small 
as  desired)  will  be  exceedingly  small.  It  is  said  that  Xx  Ex[X]  in  probability 
[Grimmett  and  Stirzaker  2001]. 


492 


CHAPTER  15.  PROBABILITY  AND  MOMENT  APPROXIMATIONS 


A 


Convergence  in  probability  does  not  mean  all  realizations  will 
converge. 


Referring  to  Figure  15.3  it  is  seen  that  for  all  realizations  except  the  top  one,  the 
error  is  small.  The  statement  of  (15.4)  does  allow  some  realizations  to  have  an 
error  greater  than  e  for  a  given  large  N.  However,  the  probability  of  this  happening 
becomes  very  small  but  not  zero  as  N  increases.  For  all  practical  purposes,  then,  we 
can  ignore  this  occurrence.  Hence,  convergence  in  probability  is  somewhat  different 
than  what  one  may  be  familiar  with  in  dealing  with  convergence  of  deterministic 
sequences .  For  deterministic  sequences,  all  sequences  (since  there  is  only  one)  will 
have  an  error  less  than  e  for  all  N  >  iVe,  where  Ne  will  depend  on  e  (see  Figure  15.1). 
The  interested  reader  should  consult  [Grimmett  and  Stirzaker  2001]  for  further 
details.  See  also  Problem  15.8  for  an  example. 

A 

We  conclude  our  discussion  with  an  example  and  some  further  comments. 

Example  15.1  —  Sample  mean  for  IID  Gaussian  random  variables 

Recall  from  the  real-world  example  in  Chapter  14  that  when  a  signal  is  present  we 
have 

Xs +Wi  ~  N(A,  a2)  i  =  1, 2, . . . ,  N. 

Since  the  random  variables  are  IID,  we  have  by  the  law  of  large  numbers  that 

Xat  -*  Ex[X\  =  A. 

Thus,  the  upper  curve  shown  in  Figure  14.6  must  approach  A  —  0.5  (with  high 
probability)  as  N  ->  oo. 

0 

In  applying  the  law  of  large  numbers  we  do  not  need  to  know  the  marginal  PDF. 
If  in  the  previous  example,  we  had  Xs+Wi  ~  W(0, 2A),  then  we  also  conclude  that 
Xn  —¥  A.  As  long  as  the  random  variables  are  IID  with  mean  A  and  a  finite 
variance,  Xn  A  (although  the  error  in  the  approximation  will  depend  upon  the 
marginal  PDF — see  Problem  15.3). 


15.5  Central  Limit  Theorem 

By  the  law  of  large  numbers  the  PMF /PDF  of  the  sample  mean  random  variable 
decreases  in  width  until  all  the  probability  is  concentrated  about  the  mean.  The 
theorem,  however,  does  not  say  much  about  the  PMF/PDF  itself.  However,  by  con¬ 
sidering  a  slightly  modified  sample  mean  random  variable,  we  can  make  some  more 
definitive  assertions  about  its  probability  distribution.  To  illustrate  the  necessity 
of  doing  so  we  consider  the  PDF  of  a  continuous  random  variable  that  is  the  sum 
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of  N  continuous  IID  random  variables.  A  particularly  illustrative  example  is  for 
Xi~U{- 1/2, 1/2). 


Example  15.2  — 

Consider  the  sum 


PDF  for  sum  of  IID  U(— 1/2, 1/2)  random  variables 

N 

SN  =  Y,Xi 

Z— 1 


where  the  X*’s  are  IID  random  variables  with  X{  ~  U{— 1/2, 1/2).  If  N  =  2,  then 
S2  =  X i  +  X2  and  the  PDF  of  $2  is  easily  found  using  a  convolution  integral  as 
described  in  Section  12.6.  Therefore, 


/oo 

Px(u)px(x  -  u)du 

-OO 


where  *  denotes  convolution.  The  evaluation  of  the  convolution  integral  is  most 
easily  done  by  plotting  px{^)  and  px{x  —  u)  versus  u  as  shown  in  Figure  15.4a. 
This  is  necessary  to  determine  the  regions  over  which  the  product  of  px(u)  and 
Px  {x  —  u)  is  nonzero  and  so  contributes  to  the  integral.  The  reader  should  be  able 
to  show,  based  upon  Figure  15.4a,  that  the  PDF  of  S2  is  that  shown  in  Figure  15.4b. 
More  generally,  we  have  from  (14.22)  that 


PX  {x  -  u) 

\  1 

.  Px{u) 

u 

_ 1 _ 

-  'll 

1 

X 

r 

1 

2 

w 

1 

2 

Ps2(x) 


(a)  Cross-hatched  region  con-  (b)  Result  of  convolution 

tributes  to  integral 


Figure  15.4:  Determining  the  PDF  for  the  sum  of  two  independent  uniform  random 
variables  using  a  convolution  integral  evaluation. 


PsN  (x)  = 


/ 


OO 


OO 


<t>x{w)  exp(-ju)x) 


du 

27T 


=  Px(x)*px(x)*---*px(x). 

" - V - ' 

(N- 1)  convolutions 


Hence  to  find  pss(x)  we  must  convolve  ps2{x)  with  px(x)  to  yield  px{x)  *Px(x)  * 
px{%)  since  ps2{x )  =  Px{x)  *Px(x )•  This  is 

/oo 

ps2{u)px{x  -  u)du 

-OO 
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but  since  px(—%)  =  Px(%),  we  can  express  this  in  the  more  convenient  form  as 

/OO 

Ps2(u)Px(u  -  x)du. 

-OO 

The  integrand  may  be  determined  by  plotting  ps2  (u)  and  the  right-shifted  version 
px{u  —  x)  and  multiplying  these  two  functions.  The  different  regions  that  must  be 
considered  are  shown  in  Figure  15.5.  Hence,  referring  to  Figure  15.5  we  have 


-11-11-11 


(a)  -3/2  <  x  <  -1/2  (b)  -1/2  <  x  <  1/2  (c)  1/2  <  *  <  3/2 

Figure  15.5:  Determination  of  limits  for  convolution  integral. 


and  ps3{x)  =  0  otherwise.  This  is  plotted  in  Figure  15.6  versus  the  PDF  of  a 
A/r(0, 3/12)  random  variable.  Note  the  close  agreement.  We  have  chosen  the  mean 
and  variance  of  the  Gaussian  approximation  to  match  that  of  ps3  (x)  (recall  that 
var(X)  =  (b  —  a)2/ 12  for  X  ~  U(a,b)  and  hence  var(JQ)  =  1/12).  If  we  continue 
the  convolution  process,  the  mean  will  remain  at  zero  but  the  variance  of  Sn  will 
be  N/ 12. 

0 

A  MATLAB  program  that  implements  a  repeated  convolution  for  a  PDF  that  is 
nonzero  over  the  interval  (0, 1)  is  given  in  Appendix  15A.  It  can  be  used  to  verify 
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x 


Figure  15.6:  PDF  for  sum  of  3  IID  U(— 1/2, 1/2)  random  variables  and  Gaussian 
approximation. 


analytical  results  and  also  to  try  out  other  PDFs.  An  example  of  its  use  is  shown  in 
Figure  15.7  for  the  repeated  convolution  of  a  U( 0, 1)  PDF.  Note  that  as  N  increases 
the  PDF  moves  to  the  right  since  E[Sn]  —  NEx[X]  —  N/2  and  the  variance  also 
increases  since  var(Sjv)  =  iVvar(X)  =  N/ 12.  Because  of  this  behavior  it  is  not 
possible  to  state  that  the  PDF  converges  to  any  PDF.  To  circumvent  this  problem 
it  is  necessary  to  normalize  the  sum  so  that  its  mean  and  variance  are  fixed  as  N 
increases.  It  is  convenient,  therefore,  to  have  the  mean  fixed  at  0  and  the  variance 
fixed  at  1,  resulting  in  a  standardized  sum.  Recall  from  Section  7.9  that  this  is  easily 
accomplished  by  forming 


Sn  —  E[SX] 


SN  -  NEx[X ] 
y/Nvar(X) 


(15.5) 


By  doing  so,  we  can  now  assert  that  this  standardized  random  variable  will  converge 
to  a  J\f{ 0, 1)  random  variable.  An  example  is  shown  in  Figure  15.8  for  X{  ~  U( 0, 1) 
and  for  N  —  2,3,4.  This  is  the  famous  central  limit  theorem ,  which  says  that  the 
PDF  of  the  standardized  sum  of  a  large  number  of  continuous  IID  random  variables 
will  converge  to  a  Gaussian  PDF.  Its  great  importance  is  that  in  many  practical 
situations  one  can  model  a  random  variable  as  having  arisen  from  the  contributions 
of  many  small  and  similar  physical  effects.  By  making  the  IID  assumption  we  can 
assert  that  the  PDF  is  Gaussian.  There  is  no  need  to  know  the  PDF  of  each  random 
variable  or  even  if  it  is  known,  to  determine  the  exact  PDF  of  the  sum,  which  may 
not  be  possible.  Some  application  areas  are: 
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(a)  px 


(b)  ps4 


PDF  for  SN  PDF  for  SN 


(c)  Ps8  (d)  ps, 2 

Figure  15.7:  PDF  of  sum  of  N IID  U (0, 1)  random  variables.  The  plots  were  obtained 
using  clt_demo.m  listed  in  Appendix  15 A. 

1.  Polling  (see  Section  15.6)  [Weisburg,  Krosnick,  Bowen  1996] 

2.  Noise  characterization  [Middleton  1960] 

3.  Scattering  effects  modeling  [Urick  1975] 

4.  Kinetic  theory  of  gases  [Reif  1965] 

5.  Economic  modeling  [Harvey  1989] 
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Figure  15.8:  PDF  of  standardized  sum  of  N  IID  U( 0, 1)  random  variables. 


and  many  more. 

We  now  state  the  theorem  for  continuous  random  variables. 

Theorem  15.5.1  (Central  limit  theorem  for  continuous  random  variables) 

If  X\,  X2, . .  • ,  Xn  are  continuous  IID  random  variables,  each  with  mean  Ex[X]  and 
variance  var(X),  and  Sjy  =  YliLi  then  as  N  — >  oo 


Sn  -  _  Eili  Xi-NEx[X} 

yfMSs)  V'^var(X)  1  ’  j‘ 

Equivalently,  the  CDF  of  the  standardized  sum  converges  to  $(a:)  or 


(15.6) 


Snj-^E\Sn] 
x/var  (Sn) 


<  X 


-oo  y/2n 


exp 


dt  =  $(x). 


(15.7) 


The  proof  is  given  in  Appendix  15B  and  is  based  on  the  properties  of  characteristic 
functions  and  the  continuity  theorem.  An  example  follows. 

Example  15.3  -  PDF  of  sum  of  squares  of  independent  Af( 0, 1)  random 
variables 

Let  Xi  ~  Af( 0, 1)  for  i  —  1,2, . . . ,  N  and  assume  that  the  X^s  are  independent. 
We  wish  to  determine  the  approximate  PDF  of  Y/v  =  YliLi  X?  as  N  becomes  large. 
Note  that  the  exact  PDF  for  Yn  is  a  x%  PDF  so  that  we  will  equivalently  find 
an  approximation  to  the  PDF  of  the  standardized  x%  random  variable.  To  apply 
the  central  limit  theorem  we  first  note  that  since  the  X^s  are  IID  so  are  the  Xf’s 
(why?).  Then  as  N  -*  oo  we  have  from  (15.6) 


ZtiXf-NEx  [X2] 
JNv&r(X2) 


V(0,1). 
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But  X2  ~  Xi  so  that  Ex[X 2]  =  1  and  var(X2)  =  2  (see  Section  10.5.6  and  Table 
11.1  for  a  Xjv  —  r(jV/2, 1/2)  PDF)  and  therefore 

N 

- >^(0,1). 

Noting  that  for  finite  N  this  result  can  be  viewed  as  an  approximation,  we  can  use 
the  approximate  result 

N 

Yn  =  Y,  xi  ~  -AW  2N ) 

2—1 

in  making  probability  calculations.  The  error  in  the  approximation  is  shown  in 
Figure  15.9,  where  the  approximate  PDF  (shown  as  the  solid  curve)  of  Y)v,  which 
is  a  Af(N,  2 iV),  is  compared  to  the  exact  PDF,  which  is  a  Xn  (shown  as  the  dashed 
curve).  It  is  seen  that  the  approximation  becomes  better  as  N  increases. 


(a)  N  =  10 


Figure  15.9:  x%  PDF  (dashed  curve) 
Af(N,2N)  (solid  curve). 


0  20  40  60  80 


X 

(b)  N  =  40 

and  Gaussian  PDF  approximation  of 


0 

For  the  previous  example  it  can  be  shown  directly  that  the  characteristic  function  of 
the  standardized  x%  random  variable  converges  to  that  of  the  standardized  Gaussian 
random  variable,  and  hence  so  do  their  PDFs  by  the  continuity  theorem  (see  Section 
11.7  for  third  property  of  characteristic  function  and  also  Problem  15.17).  We  next 
give  an  example  that  quantifies  the  numerical  error  of  the  central  limit  theorem 
approximation. 
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Example  15.4  -  Central  limit  theorem  and  computation  of  probabilities — 
numerical  results 

Recall  that  the  Erlang  PDF  is  the  PDF  of  the  sum  of  N  IID  exponential  random 
variables,  where  ~  exp(A)  for  i  =  1,2, ...  ,N  (see  Section  10.5.6).  Hence,  letting 
Yn  =  ZZi  xi  the  Erlang  PDF  is 


jj^yVN  1  exp  (-Ay)  y>  0 
0  y  <  0. 


(15.8) 


Its  mean  is  N j A  and  its  variance  is  N/X2  since  the  mean  and  variance  of  an  exp(A) 
random  variable  is  1/A  and  1/A2,  respectively.  If  we  wish  to  determine  P\Yn  >  10], 
then  from  (15.8)  we  can  find  the  exact  value  for  A  =  1  as 


But  using 


(n_  iy.yN  lexP (-y)dv- 


n 

exp(-y)  ^ 


k= 0 


[Gradshteyn  and  Ryzhik  1994],  we  have 


P[Yn  >  10] 


(N  —  1)! 


N-l  k 

(N  -  l)iexp(-y)  T[ 

k= 0  Km 


OOn 


10-J 


N-l 

exp(-lO)  ^2 

k= 0 


10fc 


A  central  limit  theorem  approximation  would  yield  Yn  Af(N/X,N/\2)  =N(N,N) 
so  that 


P[Yn  >  10]  =  Q 


10  -N 
\/N 


where  the  P  denotes  the  approximation  of  P.  The  true  and  approximate  values 
for  this  probability  are  shown  in  Figure  15.10.  The  probability  values  have  been 
connected  by  straight  lines  for  easier  viewing. 

❖ 

For  the  sum  of  IID  discrete  random  variables  the  situation  changes  markedly.  Con¬ 
sider  the  sum  of  N  IID  Ber(p)  random  variables.  We  already  know  that  the  PMF 
is  binomial  so  that 


PsN[k]  =  (^^jpk(l-p)N  k 


k  =  0, 1, . . .  ,iV  —  1. 
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Figure  15.10:  Exact  and  approximate  calculation  of  probability  that  Yn  >  10  for  Yjy 
an  Erlang  PDF.  Exact  value  shown  as  dashed  curve  and  Gaussian  approximation 
as  solid  curve. 

Hence,  this  example  will  allow  us  to  compare  the  true  PMF  against  any  approxima¬ 
tion.  For  reasons  already  explained  we  need  to  consider  the  PMF  of  the  standardized 
sum  or 

Sn  —  #[*SW]  _  Sn  —  Np 
^/var(S'jv)  \/Np(l  -  p) 

The  PMF  of  the  standardized  binomial  random  variable  PMF  withp  —  1/2  is  shown 
in  Figure  15.11  for  various  values  on  N.  Note  that  it  does  not  converge  to  any  given 


(a)  N  =  10  (b)  N  =  30  (c)  N  =  100 


Figure  15.11:  PMF  for  standardized  binomial  random  variable  with  p  =  1/2. 
PMF,  although  the  “envelope”,  whose  amplitude  decreases  as  N  increases,  appears 
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to  be  Gaussian.  The  lack  of  convergence  is  because  the  sample  space  or  values  that 
the  standardized  random  variable  can  take  on  changes  with  N.  The  possible  values 
are 


xk  = 


k-Np  _  k-N/2 
y/Np{  1-p)  ~  y/Njl 


k  =  0, 1, . . . ,  N 


which  become  more  dense  as  N  increases.  However,  what  does  converge  is  the  CDF 
as  shown  in  Figure  15.12.  Now  as  N  oo  we  can  assert  that  the  CDF  converges, 


(a)  N  =  10 


(b)  N  =  30 


(c)  N  =  100 


Figure  15.12:  CDF  for  standardized  binomial  random  variable  with  p  =  1/2. 


and  furthermore  it  converges  to  the  CDF  of  a  Af(0,  1)  random  variable.  Hence,  the 
central  limit  theorem  for  discrete  random  variables  is  stated  in  terms  of  its  CDF.  It 
says  that  as  N  -*  oo 


Sn  —  ^[£W]  ^ 

- . . . ■- . .  <  X 

A/var  (Sn) 


dt  =  $>(x) 


and  is  also  known  as  the  DeMoivre- Laplace  theorem.  We  summarize  the  central 
limit  theorem  for  discrete  random  variables  next. 


Theorem  15.5.2  (Central  limit  theorem  for  discrete  random  variables) 

If  X i,  X2, . . . ,  X/v  are  IID  discrete  random  variables ,  each  with  mean  Ex[X]  and 
variance  var(X),  and  Sn  —  Y^iLi  Xi,  then  as  N  00 


Sn  —  E[Sn  ^ 
A/var  (SN) 


dt  —  $(x) 


(15.9) 


An  example  follows. 

Example  15.5  -  Computation  of  binomial  probability 

Assume  that  Yn  ~  bin(A^,  1/2),  which  may  be  viewed  as  the  PMF  for  the  number  of 
heads  obtained  in  N  fair  coin  tosses,  and  consider  the  probability  P[k\  <  Yn  <  ^2]- 
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Then  the  exact  probability  is 


(15.10) 


A  central  limit  theorem  approximation  yields 


P[h  <YN<k2 }  =  P 


ki  -  N/2  Yn  -  N/2  <k2-  N/2 


s/WJl  ~  s/W[l  ~  s/N] 4 


since 


^[k2-N/2)_^  h-N/2 

y/Njl  )  V  y/Njl 


Yn  -Np  Yn-  N/2 


(from  (10.25)) 


y/Np(l-p)  s/N/A 

is  the  standardized  random  variable  for  p  =  1/2.  For  example,  if  we  wish  to  compute 
the  probability  of  between  490  and  510  heads  out  of  N  =  1000  tosses,  then 


P[490  <Yn<  510]  ~  $ 


510  -  500 


=  1  -Q 


s/250 
\/250  ) 

1-2QI  jm) 


$ 


490  -  500 


-  1  -Q 


=  0.4729. 


y/250 
-10 


s/250 


The  exact  value,  however,  is  from  (15.10) 


=  0.4933 


(15.11) 


(see  Problem  15.24  on  how  this  was  computed).  A  slightly  better  approximation 
using  the  central  limit  theorem  can  be  obtained  by  replacing  P[490  <Y  <  510]  with 
P[489.5  <  Y  <  510.5],  which  will  more  closely  approximate  the  discrete  random 
variable  CDF  by  the  continuous  Gaussian  CDF.  This  is  because  the  binomial  CDF 
has  jumps  at  the  integers  as  can  be  seen  by  referring  to  Figure  15.12.  By  taking  a 
slighter  larger  interval  to  be  used  with  the  Gaussian  approximation,  the  area  under 
the  Gaussian  CDF  more  closely  approximates  these  jumps  at  the  endpoints  of  the 
interval.  With  this  approximation  we  have 


P[489.5  <  Y  <  510.5]  »  Q  (489^-_5Q°"j  -  Q  ^ 


510.5  -  500 
V250 


=  0.4934 


which  is  quite  close  to  the  true  value! 


0 


15.6.  REAL-WORLD  EXAMPLE  -  OPINION  POLLING 


503 


15.6  Real-World  Example  -  Opinion  Polling 

A  frequent  news  topic  of  interest  is  the  opinion  of  people  on  a  major  issue.  For 
example,  during  the  year  of  a  presidential  election  in  the  United  States,  we  hear 
almost  on  a  daily  basis  the  percentage  of  people  who  would  vote  for  candidate  A, 
with  the  remaining  percentage  voting  for  candidate  B.  It  may  be  reported  that  75% 
of  the  population  would  vote  for  candidate  A  and  25%  would  vote  for  candidate 
B.  Upon  reflection,  it  does  not  seem  reasonable  that  a  news  organization  would 
contact  the  entire  population  of  the  United  States,  almost  294,000,000  people,  to 
determine  their  voter  preferences.  And  indeed  it  is  unreasonable!  A  more  typical 
number  of  people  contacted  is  only  about  1000.  How  then  can  the  news  organization 
report  that  75%  of  the  population  would  vote  for  candidate  A?  The  answer  lies  in 
the  polling  error  -  the  results  are  actually  stated  as  75%  with  a  margin  of  error 
of  ±3%.  Hence,  it  is  not  claimed  that  exactly  75%  of  the  population  would  vote 
for  candidate  A,  but  between  72%  and  78%  would  vote  for  candidate  A.  Even  so, 
this  seems  like  a  lot  of  information  to  be  gleaned  from  a  very  small  sample  of  the 
population. 

An  analogous  problem  may  help  to  unravel  the  mystery.  Let’s  say  we  have  a 
coin  with  an  unknown  probability  of  heads  p.  We  wish  to  estimate  p  by  tossing  the 
coin  N  times.  As  we  have  already  discussed,  the  law  of  large  numbers  asserts  that 
we  can  determine  p  without  error  if  we  toss  the  coin  an  infinite  number  of  times 
and  use  as  our  estimate  the  relative  frequency  of  heads.  However,  in  practice  we  are 
limited  to  only  N  coin  tosses.  How  much  will  our  estimate  be  in  error?  Or  more 
precisely,  how  much  can  the  true  value  deviate  from  our  estimate?  We  know  that 
the  number  of  heads  observed  in  N  independent  coin  tosses  can  be  anywhere  from 
0  to  N.  Hence,  our  estimate  of  p  for  N  =  1000  can  take  on  the  possible  values 

~-n  1  2 

P~ °’  1000’ 1000’ 

Of  course,  most  of  these  estimates  are  not  very  probable.  The  probability  that  the 
estimate  will  take  on  these  values  is 

P\p  =  it/1000]  =  Pk(\  —  p)1000~k  k  =  0, 1, ... ,  1000 

which  is  shown  in  Figure  15.13  for  p  =  0.75.  The  probabilities  for  p  outside  the 
interval  shown  are  approximately  zero.  Note  that  the  maximum  probability  is  for 
the  true  value  p  =  0.75.  To  assess  the  error  in  the  estimate  of  p  we  can  determine  the 
interval  over  which  say  95%  of  the  p’s  will  lie.  The  interval  is  chosen  to  be  centered 
about  p  =  0.75.  In  Figure  15.13  it  is  shown  as  the  interval  contained  within  the 
dashed  vertical  lines  and  is  found  by  solving 

j  (0.75)fe(0.25)1000-fc  =  0.95  (15.12) 

. . . . . . y 

P[k  heads] 
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yielding  ki  =  724  and  k%  =  776,  which  results  in  p\  =  0.724  and  p2  =  0.776.  Hence, 


0.7  0.72  0.74  0.76  0.78  0.8 


P 

Figure  15.13:  PMF  for  estimate  of  p  for  a  binomial  random  variable.  Also,  shown 
as  the  dashed  vertical  lines  are  the  boundaries  of  the  interval  within  which  95%  of 
the  estimates  will  lie. 

for  p  =  0.75  we  see  that  95%  of  the  time  (if  we  kept  repeating  the  1000  coin  toss 
experiment),  the  value  of  p  would  be  in  the  interval  [0.724,0.776].  We  can  assert 
that  we  are  95%  confident  that  for  p  =  0.75 

p  —  0.026  <  p  <  p  +  0.026 


or 

—p  4-  0.026  >  —p  >  —p  —  0.026 

or  finally 

p  —  0.026  <  p  <  p  +  0.026. 

The  interval  [p— 0.026,  p+ 0.026]  is  called  the  95%  confidence  interval  It  is  a  random 
interval  that  covers  the  true  value  of  p  =  0.75  for  95%  of  the  time.  As  an  example 
a  MATLAB  simulation  is  shown  in  Figure  15.14.  For  each  of  50  trials  the  estimate 
of  p  is  shown  by  the  dot  while  the  confidence  interval  is  indicated  by  a  vertical  line. 
Note  that  only  3  of  the  intervals  fail  to  cover  the  true  value  of  p  —  0.75.  With  50 
trials  and  a  probability  of  0.95  we  expect  2.5  intervals  not  to  cover  the  true  value. 

Instead  of  having  to  compute  k\  and  &2  using  (15.12),  it  is  easier  in  practice  to 
use  the  central  limit  theorem.  Since  p  =  with  X{  ~  Ber(p),  is  a  sum 

of  IID  random  variables  we  can  assert  from  Theorem  15.5.2  that 


a 

1 

<&. 

V 

1 

1 _ 

-  <  b 

^var(p) 

^  1/ 

«  $(&)  -  $(-&). 


15.6.  REAL-WORLD  EXAMPLE  -  OPINION  POLLING 
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Figure  15.14:  95%  confidence  interval  for  estimate  of  p  —  0.75  for  a  binomial  random 
variable.  The  estimates  are  shown  as  dots. 


Noting  that  Xi  ~  Ber(p),  E\p]  =  E[YliLiXi/N]  —  Np/N  =  p  and  var(p)  = 
var(]C£i  Xi/N)  =  Np(  1  — p)/N 2  =  p(  1  —p)/N,  we  have 


b  < 


P~P 


>/p(l  ~P)/N 


<  b 


$(&)  -  $(-&). 


For  a  95%  confidence  interval  or  4>(i>)  —  $(— b)  =  0.95,  we  have  b  =  1.96,  as  may  be 
easily  verified.  Hence,  we  can  use  the  approximation 


-1.96  < 


P~P 


VpO-  -p)/n 


<  1.96 


which  after  the  same  manipulation  as  before  yields  the  confidence  interval 


p  -  1.96 \l  — [—  —  <p<p  +  1.96-  P ^ 


N 


N 


(15.13) 


The  only  difficulty  in  applying  this  result  is  that  we  don’t  know  the  value  of  p,  which 
arose  from  the  variance  of  p.  To  circumvent  this  there  are  two  approaches.  We  can 
replace  p  by  its  estimate  to  yield  the  confidence  interval 


t  -  1.96,/ <  p  <  p  +1.96  /#(1  “■ ft 


N 


N 


(15.14) 


A  more  conservative  approach  is  to  note  that  p(  1  —  p)  is  maximum  for  p  =  1/2. 
Using  this  number  yields  a  larger  interval  than  necessary.  However,  it  allows  us  to 
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determine  before  the  experiment  is  performed  and  the  value  of  p  revealed,  the  length 
of  the  confidence  interval.  This  is  useful  in  planning  how  large  N  must  be  in  order 
to  have  a  confidence  interval  not  exceeding  a  given  length  (see  Problem  15.25).  If 
we  adopt  the  latter  approach  then  the  confidence  interval  becomes 


p  ±  1.96 


P(  1  ~  p) 
N 


p  ±  1.96 


1/4 

N 


p± 


1 

Vn' 


In  summary,  if  we  toss  a  coin  with  a  probability  p  of  heads  N  times,  then  the  interval 
[p  —  1/y/N^p  +  1/VN]  will  contain  the  true  value  of  p  more  than  95%  of  the  time. 
It  is  said  that  the  error  in  our  estimate  of  p  is  ±1  /y/N. 

Finally,  returning  to  our  polling  problem  we  ask  N  people  if  they  will  vote  for 
candidate  A.  The  probability  that  a  person  chosen  at  random  will  say  “yes”  is  p, 
because  the  proportion  of  people  in  the  population  who  will  vote  for  candidate  A 
is  p.  We  liken  this  to  tossing  a  single  coin  and  noting  if  it  comes  up  a  head  (vote 
“yes”)  or  a  tail  (vote  “no”).  Then  we  continue  to  record  the  responses  of  N  people 
(continue  to  toss  the  coin  N  times).  Assume,  for  example,  750  people  out  of  1000  say 
“yes”.  Thenp  =  750/1000  =  0.75  and  the  margin  of  error  is  ±1  /VN  &  3%.  Hence, 
we  report  the  results  as  75%  of  the  population  would  vote  for  candidate  A  with  a 
margin  of  error  of  3%.  (Probabilistically  speaking,  if  we  continue  to  poll  groups  of 
1000  voters,  estimating  p  for  each  group,  then  about  95  out  of  100  groups  would 
cover  the  true  value  of  100p%  by  their  estimated  interval  [lOOp  —  3,  lOOp  +  3]  %.) 
We  needn’t  poll  294,000,000  people  since  we  assume  that  the  percentage  of  the  1000 
people  polled  who  would  vote  for  candidate  A  is  representative  of  the  percentage  of  the 
entire  population.  Is  this  true?  Certainly  not  if  the  1000  people  were  all  relatives  of 
candidate  A.  Pollsters  make  their  living  by  ensuring  that  their  sample  (1000  people 
polled)  is  a  representative  cross-section  of  the  entire  United  States  population. 
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Problems 

15.1  (f)  For  the  PMF  given  by  (15.2)  plot  the  CDF  for  N  =  10,  N  =  30,  and 
N  —  100.  What  function  does  the  CDF  appear  to  converge  to? 

15.2  (c)  If  Xi  ~  N(l,  1)  for  i  =  1, 2 . . . ,  N  are  IID  random  variables,  plot  a  real¬ 
ization  of  the  sample  mean  random  variable  versus  N.  Should  the  realization 
converge  and  if  so  to  what  value? 

15.3  (w,c)  Let  ~  U( 0,2)  for  i  =  1, 2  . . . ,  N  be  IID  random  variables  and  let 
X2i  ~  A7(l,  4)  for  i  =  1, 2  . . . ,  N  be  another  set  of  IID  random  variables.  If  the 
sample  mean  random  variable  is  formed  for  each  set  of  IID  random  variables, 
which  one  should  converge  faster?  Implement  a  computer  simulation  to  check 
your  results. 

15.4  (^)  (w)  Consider  the  weighted  sum  of  N  IID  random  variables  Yjv  =  YliLi  aiXi- 
If  Ex[X]  =  0  and  var(X)  =  1,  under  what  conditions  will  the  sum  converge 
to  a  number?  Can  you  give  an  example,  other  than  ai  —  1/AT,  of  a  set  of  ai  s 
which  will  result  in  convergence? 

15.5  (w)  A  random  walk  is  defined  as  X x  =  X^-i  +  Un  for  iV  =  2,3, .. .  and 
X\  —  J7i,  where  the  U{  s  are  IID  random  variables  with  P[Ui  =  —1]  =  P[U{  = 
+1]  =  1/2.  Will  Xjsr  converge  to  anything  as  N  -*  oo? 

15.6  (w)  To  estimate  the  second  moment  of  a  random  variable  it  is  proposed  to 
use  (1/AT)  YliLi  X?-  Under  what  conditions  will  the  estimate  converge  to  the 
true  value? 

15.7  (^)  (w)  If  Xi  for  i  =  1, 2 . . . ,  AT  are  IID  random  variables,  will  the  random 
variable  (1  /y/N)  YliLi  Xi  converge  to  a  number? 


15.8  (t,c)  In  this  problem  we  attempt  to  demonstrate  that  convergence  in  prob¬ 
ability  is  different  than  standard  convergence  of  a  sequence  of  real  numbers. 
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Consider  the  sequence  of  random  variables 


v  _  XN  , 

Yn  —  — 1=  “I-  u 


X 


N 


Vn  W n 


-  o.i 


where  the  X^v’s  are  IID,  each  with  PDF  ~  A/"(0, 1)  and  u{x)  is  the  unit 


step  function.  Prove  that  P[|Y/v|  >  e 
total  probability  as 


0  as  TV  — >  oo  by  using  the  law  of 


P\Yn  >  e 


—  P  Yn  >  6  Xjy/pN  >  0.1  ]P[Xn  /  Yn  >  0.1] 

+  P  Yn  >  e  Xn /Yn  Y  0.1  ]P[Xn  /  Yn  <  0.1]. 


This  says  that  Yn  — >  0  in  probability.  Next  simulate  this  sequence  on  the 
computer  for  N  =  1,2,...,  200  to  generate  4  realizations  of  {Yi,  Y2  ? • • • >  l20o}- 
Examine  whether  for  a  given  N  all  realizations  he  within  the  “convergence 
band”  of  [—0.2,  0.2].  Next  generate  an  additional  6  realizations  and  overlay  all 
10  realizations.  What  can  you  say  about  the  convergence  of  any  one  realiza¬ 
tion? 


15.9  (w)  There  are  1000  resistors  in  a  bin  labeled  10  ohms.  Due  to  manufacturing 
tolerances,  however,  the  resistance  of  the  resistors  are  somewhat  different. 
Assume  that  the  resistance  can  be  modeled  as  a  random  variable  with  a  mean 
of  10  ohms  and  a  variance  of  2  ohms2.  If  100  resistors  are  chosen  from  the 
bin  and  connected  in  series  (so  the  resistances  add  together),  what  is  the 
approximate  probability  that  the  total  resistance  will  exceed  1030  ohms? 

15.10  ( w)  Consider  a  sequence  of  random  variables  X\ ,  X\ ,  X2 ,  X2 ,  X3 ,  A3 , . . . ,  where 
Ai,  A2,  A3  . . .  are  IID  random  variables.  Does  the  law  of  large  numbers  hold? 
How  about  the  central  limit  theorem? 


15.11  (w)  Consider  an  Erlang  random  variable  with  parameter  A.  If  N  increases, 
does  the  PDF  become  Gaussian?  Hint:  Compare  the  characteristic  functions 
of  the  exponential  random  variable  and  the  T(A,  A)  random  variable  in  Table 

11.1. 

15.12  (f)  Find  the  approximate  PDF  of  Y  —  Xa=i  A2,  if  the  A^’s  are  IID  with 
A, 4,8). 

15.13  (o)  (f)  Find  the  approximate  PDF  of  Y  —  Xa=i°  Ai,  if  the  A^’s  are  IID 
with  A i  —  U(  1, 3). 

15.14  (f)  Find  the  approximate  probability  that  Y  —  Xa=i  w ill  exceed  7,  if  the 
A^’s  are  IID  with  the  PDF 


Px(x) 


2x  0  <  x  <  1 

0  otherwise. 
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15.15(c)  Modify  the  computer  program  clt_demo.m  listed  in  Appendix  15A  to 
display  the  repeated  convolution  of  the  PDF 


Px(x)  = 


|  sin(7nr)  0  <  x  <  1 
0  otherwise. 


and  examine  the  results. 


15.16  (c)  Use  the  computer  program  clt.demo  .m  listed  in  Appendix  15A  to  display 
the  repeated  convolution  of  the  PDF  U( 0,1).  Next  modify  the  program  to 
display  the  repeated  convolution  of  the  PDF 


Px{x) 


\2-4x 

0 


0  <  x  <  1 
otherwise. 


Which  PDF  results  in  a  faster  convergence  to  a  Gaussian  PDF  and  why? 

15.17  (t)  In  this  problem  we  prove  that  the  PDF  of  a  standardized  Xn  r&ndom 
variable  converges  to  a  Gaussian  PDF  as  N  ->  oo.  To  do  so  let  Yjy  ~  X%  and 
show  that  the  characteristic  function  is 

^ Yn  ^  ~  (1  -  2 

by  using  Table  11.1.  Next  define  the  standardized  random  variable 

v  Yn  -  E[Yn ] 

~  - r . . =~ 

v/var(yw) 

and  note  that  the  mean  and  variance  of  a  x%  random  variable  is  N  and  2 iV, 
respectively.  Show  the  characteristic  function  of  Zjy  is 

j.  ,  ^  _  exp(-ju)T/Nj2) 

M  )  ~ 

Finally,  take  the  natural  logarithm  of  <fizN{w)  and  note  that  for  a  complex 
variable  x  with  \x\  <C  1,  we  have  that  ln(l  —  x)  «  —x  —  x2 /2.  You  should  be 
able  to  show  that  as  N  oo,  ln^^u;)  — >  —u2/ 2. 

15.18  (w)  A  particle  undergoes  collisions  with  other  particles.  Each  collision  causes 
its  horizontal  velocity  to  change  according  to  a  Af  (0, 0.1)  cm/sec  random  vari¬ 
able.  After  100  independent  collisions  what  is  the  probability  that  the  parti¬ 
cle’s  velocity  will  exceed  5  cm/sec  if  it  is  initially  at  rest?  Is  this  result  exact 
or  approximate? 


15.19  (^)  (f)  The  sample  mean  random  variable  of  N  IID  random  variables  with 
X{  ~  U( 0,1)  will  converge  to  1/2.  How  many  random  variables  need  to  be 
averaged  before  we  can  assert  that  the  approximate  probability  of  an  error  of 
not  more  than  0.01  in  magnitude  is  0.99? 
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15.20  (o)  (w)  An  orange  grove  produces  oranges  whose  weights  are  uniformly 
distributed  between  3  and  7  ozs.  If  a  truck  can  hold  4000  lbs.  of  oranges,  what 
is  the  approximate  probability  that  it  can  carry  15,000  oranges? 

15.21  (w)  A  sleeping  pill  is  effective  for  75%  of  the  population.  If  in  a  hospital  160 
patients  are  given  a  sleeping  pill,  what  is  the  approximate  probability  that  125 
or  more  of  them  will  sleep  better? 

15.22  (^)  (w)  For  which  PDF  will  a  sum  of  IID  random  variables  when  added 
together  have  a  PDF  that  converges  to  a  Gaussian  PDF  the  fastest? 

15.23  (o)  (w)  A  coin  is  tossed  1000  times,  producing  750  heads.  Is  this  a  fair 
coin? 

15.24  (f,c)  To  compute  the  probability  of  (15.11)  we  can  use  the  following  approach 
to  compute  each  term  in  the  summation.  Each  term  can  be  written  as 

/ N\(1\n  _  N(N -k  +  1)  fl\N 

U/W  “  1(2)(3)  •••(*)  W  ' 

Taking  the  natural  logarithm  produces 

N  k 

In Pyn  [A:]  =  ^  ln(i)  -  ^  ln(i)  -  N  ln(2) 

i=N—k+ 1  i= 1 

which  is  easily  done  on  a  computer.  Next,  exponentiate  to  find  pyN  [fc]  and  add 
each  of  the  terms  together  to  finally  implement  the  summation.  Carry  this 
out  to  verify  the  result  given  in  (15.11).  What  happens  if  you  try  to  compute 
each  term  directly? 

15.25  (f)  In  a  poll  of  candidate  preferences  for  two  candidates,  we  wish  to  report 
that  the  margin  of  error  is  only  ±1%.  What  is  the  maximum  number  of  people 
whom  we  will  need  to  poll? 

15.26  (^)  (w)  A  clinical  trial  is  performed  to  determine  if  a  particular  drug  is 
effective.  A  group  of  100  people  is  split  into  two  equal  groups  at  random.  The 
drug  is  administered  to  group  1  while  group  2  is  given  a  placebo.  As  a  result 
of  the  study,  40  people  in  group  1  show  a  marked  improvement  while  only 
30  people  in  group  2  do  so.  Is  the  drug  effective?  Hint:  Find  the  confidence 
intervals  (using  (15.14))  for  the  percentage  of  the  people  in  each  group  who 
show  an  improvement. 


Appendix  15  A 


MATLAB  Program  to  Compute 
Repeated  Convolution  of  PDFs 


7,  This  program  demonstrates  the  central  limit  theorem.  It  determines 
70  the  PDF  for  the  sum  S_N  of  N  IID  random  variables.  Each  marginal  PDF 
7,  is  assumed  to  be  nonzero  over  the  interval  (0,1).  The  repeated 
7®  convolution  integral  is  implemented  using  a  discrete  convolution.  The 
7.  plots  of  the  PDF  of  S_N  as  N  increases  are  shown  successively 
7.  (press  carriage  return  for  next  plot)  . 

7o 

7o  clt_demo.m 
clear  all 
delu=0.005; 

u=[0:delu:  1-delu]  , ;  7*  p_X  defined  on  interval  [0,1) 

p_X=ones  ( length (u)  ,1)  ;  7*  try  p_X=abs(2-4*u)  for  really  strange  PDF 

x=[u;u+l];  7*  increase  abcissa  values  since  repeated 

7.  convolution  increases  nonzero  width  of  output 
p_S=zeros(length(x) ,1) ; 

N=12;  7.  number  of  random  variables  summed 

for  j=l  :length(x)  7*  start  discrete  convolution  approximation 

7.  to  continuous  convolution 
for  i=l : length(u) 

if  j-i>0&j-i<=length(p_X) 
p_S(j)=p_S(j)+p_X(i)*p_X(j-i)*delu; 
end 
end 
end 

plot(x,p_S)  7o  plot  results  for  N=2 
grid 
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axis([0  N  0  1])  7.  set  axes  lengths  for  plotting 

xlabelC 'x' ) 

ylabel( 'p_S' ) 

title  ('PDF  for  S_N') 

t ext (0.75*N, 0.85,  'N  =  2')  7#  label  plot  with  the 

7.  number  of  convolutions 

for  n=3:N 
pause 

x=[x;u+n-l];  7*  increase  abcissa  values  since 

7o  repeated  convolution  increases 
7#  nonzero  width  of  output 
p_S= [p_S ; zeros (length(u) , 1) ] ; 
g=zeros (length (p_S) ,1) ; 

for  j=l:length(x)  7®  start  discrete  convolution 
for  i=l :length(u) 
if  j-i>0 

g(j ,l)=g(j ,l)+p_X(i)*p_S(j-i)*delu; 
end 
end 
end 

p_S=g;  7o  plot  results  for  N=3,4,...,12 

plot (x,p_S) 

grid 

axis([0  N  0  1]) 

xlabel( 'x' ) 

ylabel( 'p_S' ) 

title  ('PDF  for  S_N') 

text (0.75*N,0.85,  ['N  =  '  num2str(n)]) 

end 


Appendix  15B 


Proof  of  Central  Limit  Theorem 


In  this  appendix  we  prove  the  central  limit  theorem  for  continuous  random  variables. 
Consider  the  characteristic  function  of  the  standardized  continuous  random  variable 


Zn  = 


Sn-NEx[X } 
y/Nv ar(X) 


where  5jv  =  Yli=i  an(l  the  Xi  s  are  IID.  By  definition  of  Z\:  the  characteristic 
function  becomes 

4>zn  M  =  Ezn  [exp(juZN)} 


=  Ex 


,  ■  Y,iLiXi-NEx[X]\ 
P  JU)  y/Nv at(X)  ) 


„  [A  (  *<-£*[*]  Y 

Ex  II  exp  ju)  ==- 

Lt i  V  V^var(X)  J  J 


N 


=  n*. 


i— 1 


Ex 


.  .  Xi-Ex[X}\ . 

exp|3“7^lJj 

.  I-JiBV1* 

6XP  1  ^  v/Arvar(X)  j 


(independence  of  X^s) 


(identically  distributed  X^s) 


But  for  a  complex  variable  £  we  can  write  its  exponential  as  a  Taylor  series  yielding 


°° 


k= 0 


exp(£)  =  (see  Problem  5.22). 


Thus, 


Ex 


.  X-Ex[X] 
exp  |  ju  = 

\J  Avar  (A) 
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=  Ex  Yj 


k= 0 


(ju)k  (x-  Ex[X  1 

k\  l  JNvai(X) 


E 

k= 0 


CM 


Ex 


X- Ex  [X] 
x/lVvar(X) 


(assume  interchange  valid) 


i +*■’£*  aaa  (aaa 

i/iVvar(X)  2  V  A/JVvar(X) 


+  Sx[JR(X)] 


where  R(X)  is  the  third-order  and  higher  terms  of  the  Taylor  expansion.  But 


Ex 


X- Ex  [X] 
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The  terms  comprising  R(X)  are 
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which  can  be  shown  to  be  small,  due  to  the  division  of  the  successive  terms  by 
IV3/2,  N2, . . .,  relative  to  the  —to2/ (2 N)  term.  Hence  as  N  oo,  they  do  not 
contribute  to  <f>zN  (X)  and  therefore 


<t>zN  M  -> 


2  \  if 


2  N 


~ ■>  exp  I  —  -u>  1  =  fe(w) 


(see  Problem  5.15) 


where  Z  ~  Jif( 0, 1).  Since  the  characteristic  function  of  Zx  converges  to  the  char¬ 
acteristic  function  of  Z,  we  have  by  the  continuity  theorem  (see  Section  11.7)  that 
the  PDF  of  Zx  must  converge  to  the  PDF  of  Z.  Therefore,  we  have  finally  that  as 
IV  — »  oo 

PzN{z )  ->pz(z)  =  ^=exp  (~\z)  • 


Chapter  16 

Basic  Random  Processes 


16.1  Introduction 


So  far  we  have  studied  the  probabilistic  description  of  a  finite  number  of  random 
variables.  This  is  useful  for  random  phenomena  that  have  definite  beginning  and 
end  times.  Many  physical  phenomena,  however,  are  more  appropriately  modeled  as 
ongoing  in  time.  Such  is  the  case  for  the  annual  summer  rainfall  in  Rhode  Island 
as  shown  in  Figure  1.1  and  repeated  for  convenience  in  Figure  16.1.  This  physical 


Year 


Figure  16.1:  Annual  summer  rainfall  in  Rhode  Island  from  1895  to  2002. 

process  has  been  ongoing  for  all  time  and  will  undoubtedly  continue  into  the  future. 
It  is  only  our  limited  ability  to  measure  the  rainfall  over  several  lifetimes  that  has 
produced  the  data  shown  in  Figure  16.1.  It  therefore  seems  more  reasonable  to 
attempt  to  study  the  probabilistic  characteristics  of  the  annual  summer  rainfall  in 
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Rhode  Island  for  all  time.  To  do  so  let  X[n]  be  a  random  variable  that  denotes 
the  annual  summer  rainfall  for  year  n.  Then,  we  will  be  interested  in  the  behav¬ 
ior  of  the  infinite  tuple  of  random  variables  (. . . ,  X[— 1],  X[0],  X[l], . . .),  where  the 
corresponding  year  for  n  =  0  can  be  chosen  for  convenience  (maybe  according  to 
the  Christian  or  Hebrew  calendars,  as  examples).  Note  that  we  cannot  employ  our 
previous  probabilistic  methods  directly  since  the  number  of  random  variables  is  not 
finite  or  iV-dimensional. 

Given  our  interest  in  the  annual  summer  rainfall,  what  types  of  questions  are 
pertinent?  A  meterologist  might  wish  to  determine  if  the  rainfall  totals  are  increas¬ 
ing  with  time.  Hence,  he  may  question  if  the  average  rainfall  is  really  constant.  If  it 
is  not  constant  with  time,  then  our  estimate  of  the  average,  obtained  by  taking  the 
sample  mean  of  the  values  shown  in  Figure  16.1,  is  meaningless.  As  an  example, 
we  would  also  have  obtained  an  average  of  9.76  inches  if  the  rainfall  totals  were  in¬ 
creasing  linearly  with  time,  starting  at  7.76  inches  and  ending  at  11.76  inches.  The 
meterologist  might  argue  that  due  to  global  warming  the  rainfall  totals  should  be 
increasing.  We  will  return  to  this  question  in  Section  16.8.  Another  question  might 
be  to  assess  the  probability  that  the  following  year  the  rainfall  will  be  12  inches  or 
more  if  we  know  the  entire  past  history  of  rainfall  totals.  This  is  the  problem  of 
prediction,  which  is  a  fundamental  problem  in  many  scientific  disciplines. 

A  second  example  of  a  random  process,  which  is  of  intense  interest,  is  a  man¬ 
made  one:  the  Dow-Jones  industrial  average  (DJIA)  for  stocks.  At  the  end  of  each 
trading  day  the  average  of  the  prices  of  a  representative  group  of  stocks  is  computed 
to  give  an  indication  of  the  health  of  the  U.S.  stock  market.  Its  usefulness  is  that 
this  value  also  gives  an  indication  of  the  overall  health  of  the  U.S.  economy.  Some 
recent  weekly  values  are  shown  in  Figure  16.2.  The  overall  trend  beginning  at  week 
10  is  upward  until  about  week  60,  at  which  point  it  fluctuates  up  and  down.  Some 
questions  of  interest  are  whether  the  index  will  go  back  up  again  after  week  92 
and  to  what  degree  is  it  possible  to  predict  the  movement  of  the  stock  market,  of 
which  the  DJIA  is  an  indicator.  The  financial  industry  and  in  fact  the  health  of  the 
U.S.  economy  depends  in  a  large  degree  upon  the  answers  to  these  questions!  In  the 
remaining  chapters  we  will  describe  the  theory  and  application  of  random  processes. 
As  always,  the  theory  will  serve  as  a  foundation  upon  which  we  will  be  able  to  analyze 
random  processes.  In  any  practical  situation,  however,  the  ideal  theoretical  analysis 
must  be  tempered  with  the  constraints  and  additional  complexities  of  the  real  world. 


16.2  Summary 

A  random  process  is  defined  in  Section  16.3.  Four  different  types  of  random  pro¬ 
cesses  are  described  in  Section  16.4.  They  are  classified  according  to  whether  they 
are  defined  for  all  time  or  only  for  uniformly  spaced  time  samples,  and  also  accord¬ 
ing  to  their  possible  values  as  being  discrete  or  continuous.  Figure  16.5  illustrates 
the  various  types.  A  stationary  random  process  is  one  for  which  its  probabilistic 
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Figure  16.2:  Dow-Jones  industrial  average  at  the  end  of  each  week  from  January  8, 
2003  to  September  29,  2004  [DowJones.com  2004]. 


description  does  not  change  with  the  chosen  time  origin,  which  is  expressed  mathe¬ 
matically  by  (16.3).  An  IID  random  process  is  stationary  as  shown  in  Example  16.3. 
The  concept  of  a  random  process  having  stationary  and  independent  increments  is 
described  in  Section  16.5  with  an  illustration  given  in  Example  16.5.  Some  more 
examples  of  random  processes  are  given  in  Section  16.6.  The  most  useful  moments 
of  a  random  process,  the  mean  sequence  and  the  covariance  sequence,  are  defined 
by  (16.5)  and  (16.7),  respectively.  Finally,  in  Section  16.8  an  application  of  the 
estimation  of  the  mean  sequence  to  predicting  average  rainfall  totals  is  described. 
The  least  squares  estimator  of  the  slope  and  intercept  of  a  straight  line  is  found 
using  (16.9)  and  is  commonly  used  in  data  analysis  problems. 

16.3  What  Is  a  Random  Process? 

To  define  the  concept  of  a  random  process  we  will  begin  by  considering  our  usual 
example  of  a  coin  tossing  experiment.  Assume  that  at  some  start  time  we  toss 
a  coin  and  then  repeat  this  subexperiment  at  one  second  intervals  for  all  time. 
Letting  n  denote  the  time  in  seconds,  we  therefore  generate  successive  outcomes 
at  times  n  =  0,1,...  .  The  experiment  continues  indefinitely.  Since  there  are 
two  possible  outcomes  for  each  coin  toss  and  we  will  assume  that  the  tosses  are 
independent,  we  have  an  infinite  sequence  of  Bernoulli  trials.  This  is  termed  a 
Bernoulli  random  process  and  extends  the  finite  Bernoulli  set  of  random  variables 
first  introduced  in  Section  4.6.2,  in  which  a  finite  number  of  trials  were  carried 
out.  As  usual,  we  let  the  probability  of  a  head  (X  =  1)  be  p  and  the  prob- 
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ability  of  a  tail  ( X  =  0)  be  1  —  p  for  each  trial.  With  this  setup,  a  random 
process  can  be  defined  as  a  mapping  from  the  original  experimental  sample  space 
5  =  {{H,  H,  T, ...),  (H,  T,  H,  ...),(T,T,H, to  the  numerical  sample  space 
Sx  =  {(1, 1,0, . . .),  (1,0, 1, . . .),  (0,0, 1, Note  that  each  simple  event  or  el¬ 
ement  of  S  is  an  infinite  sequence  of  if’s  and  T’s  which  is  then  mapped  into  an 
infinite  sequence  of  l’s  and  0’s,  which  is  the  corresponding  simple  event  in  Sx  •  One 
may  picture  a  random  process  as  being  generated  by  the  “random  process  gener¬ 
ator”  shown  in  Figure  16.3.  The  random  process  is  composed  of  the  infinite  (but 


(X[0],X[1],.  ._) _ 

+  PMF  description 


Random  process 
generator 


(rc[0],x[l], . . .) 


Figure  16.3:  A  conceptual  random  process  generator.  The  input  is  an  infinite  se¬ 
quence  of  random  variables  with  their  probabilistic  description  and  the  output  is  an 
infinite  sequence  of  numbers. 

countable)  “vector”  of  random  variables  (X[0],  X[l], . . .),  each  of  which  is  a  Bernoulli 
random  variable,  and  each  outcome  of  the  random  process  is  given  by  the  infinite 
sequence  of  numerical  values  (x[0],  rc[l], . . .).  As  usual,  uppercase  letters  are  used  for 
the  random  variables  and  lowercase  letters  for  the  values  they  take  on.  Some  typical 
outcomes  of  the  Bernoulli  random  process  are  shown  in  Figure  16.4.  They  were 
generated  in  MATLAB  using  x=floor(rand(31,l)+0.5)  for  each  outcome.  Each 
sequence  in  Figure  16.4  is  called  an  outcome  or  by  its  synonyms  of  realization  or 
sample  sequence.  We  will  prefer  the  use  of  the  term  “realization” .  Each  realization 
is  an  infinite  sequence  of  numbers.  Hence,  the  random  process  is  a  mapping  from  <S, 
which  is  a  set  of  infinite  sequential  experimental  outcomes,  to  Sx,  which  is  a  set  of 
infinite  sequences  of  l’s  and  0’s  or  realizations.  The  total  number  of  realizations  is 
not  countable  (see  Problem  16.3).  The  set  of  all  realizations  is  sometimes  referred 
to  as  the  ensemble  of  realizations.  Just  as  for  the  case  of  a  single  random  variable, 
which  is  a  mapping  from  S  to  Sx  and  therefore  is  represented  as  the  set  function 
X(s ),  a  similar  notation  is  used  for  random  processes.  Now,  however,  we  will  use 
X[n,  s]  to  represent  the  mapping  from  an  element  of  S  to  a  realization  x[n].  In 
Figure  16.4  we  see  the  result  of  the  mapping  for  s  =  $i,  which  is  X[n,«Si]  =  x\ [n], 
as  well  as  others.  It  is  important  to  note  that  if  we  fix  n  at  n  =  18,  for  example, 
then  X[18,s]  is  a  random  variable  that  has  a  Bernoulli  PMF.  Three  of  its  outcomes 
are  shown  highlighted  in  Figure  16.4  with  dashed  boxes.  Hence,  all  the  methods 
developed  for  a  single  random  variable  are  applicable.  Likewise,  if  we  fix  two  sam¬ 
ples  at  n  =  20  and  n  =  22,  then  X[20,  s]  and  X[22,s]  becomes  a  bivariate  random 
vector.  Again  all  our  previous  methods  for  two-dimensional  random  vectors  apply. 

To  summarize,  a  random  process  is  defined  to  be  an  infinite  sequence  of  random 
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Figure  16.4:  Typical  outcomes  of  Bernoulli  random  process  with  p  =  0.5.  The 
realization  starts  at  n  =  0  and  continues  indefinitely.  The  dashed  box  indicates  the 
realizations  of  the  random  variable  X[18,s]. 
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variables  (X(0),X(1), . . .),  with  one  random  variable  for  each  time  instant,  and 
each  realization  of  the  random  process  takes  on  a  value  that  is  represented  as  an 
infinite  sequence  of  numbers  or  (x[0],  a;[l], . . .).  We  will  denote  the  random  process 
more  succinctly  by  X[n]  and  the  realization  by  x[n\  but  it  is  understood  that  the  n 
denotes  the  values  n  —  0, 1, ... .  If  we  wish  to  indicate  the  random  process  at  a  fixed 
time  instant ,  then  we  will  use  n  =  no  or  n  =  ni,  etc.  so  that  X[no]  is  the  random 
process  at  n  =  no  (which  is  just  a  random  variable)  and  its  realization  at  that  time 
is  a; [no]  (which  is  a  number).  Finally,  we  have  used  the  [•]  notation  to  remind  us 
that  X[n\  is  defined  only  for  discrete  integer  times.  This  type  of  random  process  is 
known  as  a  discrete-time  random  process.  In  the  next  section  the  continuous-time 
random  process  will  be  discussed.  Before  continuing,  however,  we  look  at  a  typical 
probability  calculation  for  a  random  process. 

Example  16.1  -  Bernoulli  random  process 

For  the  infinite  coin  tossing  example,  we  might  ask  for  the  probability  of  the  first 
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5  tosses  coming  up  all  heads.  Thus,  we  wish  to  evaluate 

Ppc[0]  =  1,  X[l]  =  1,  X[2]  =  1,  X[3]  =  1,  X[A ]  =  1,  X[5]  =  0  or  1,  X[6 }  =  0  or  1, . . .]. 

It  would  seem  that  since  we  don’t  care  what  the  outcomes  of  X [n]  for  n  =  5, 6, . . . 
are,  then  the  probability  expression  could  be  replaced  by 

P[X[0]  =  1,X[1]  =  1,X[2]  =  1,X[3]  =  1,X[4]  =  1] 

and  indeed  this  is  the  case,  although  it  is  not  so  easy  to  prove  [Billingsley  1986]. 
Then,  by  using  the  assumption  of  independence  of  a  Bernoulli  random  process  we 
have 

4 

P[X[0]  =  1,X[1]  =  1,X[2]  =  1,X[3]  =  1,X[4]  =  1]  =  n  P{X[n ]  =  1]  =p5. 

n= 0 

A  related  question  is  to  determine  the  probability  that  we  will  ever  observe  5  ones 
in  a  row.  Intuitively,  we  expect  this  probability  to  be  1,  but  how  do  we  prove  this? 
It  is  not  easy!  Such  is  the  difficulty  encountered  when  we  make  the  leap  from  a 
random  vector,  having  a  finite  number  of  random  variables,  to  a  random  process, 
having  an  infinite  number  of  random  variables. 

❖ 


16.4  Types  of  Random  Processes 

The  previous  example  of  an  infinite  number  of  coin  tosses  produced  a  random  process 
X[n\  for  n  =  0, 1, . . .  .  In  some  cases,  however,  we  wish  to  think  of  the  random 
process  as  having  started  sometime  in  the  infinite  past.  If  X[n\  is  defined  for  n  = 

. . . ,  —1,0, 1, . . .  or  equivalently  — oo  <  n  <  oo,  where  it  is  assumed  that  n  is  an 
integer,  then  X[n\  is  called  an  infinite  random  process.  In  contrast,  the  previous 
example  is  referred  to  as  a  semi-infinite  random  process.  Another  categorization 
of  random  processes  involves  whether  the  times  at  which  the  random  variables  are 
defined  and  the  values  that  they  take  on  are  either  discrete  or  continuous.  The 
infinite  coin  toss  example  is  a  discrete-time  random  process,  since  it  is  defined  for  n  = 
0, 1, . . .,  and  is  a  discrete-valued  random  process,  since  it  takes  on  values  0  and  1  only. 
It  is  referred  to  as  a  discrete-time/ discrete  valued  (DTDV)  random  process.  Other 
types  of  random  processes  are  discrete-time/continuous- valued  (DTCV),  continuous- 
time/discrete- valued  (CTDV),  and  continuous-time/continuous- valued  (CTCV).  A 
realization  of  each  type  is  shown  in  Figure  16.5.  In  Figure  16.5a  a  realization  of 
the  Bernoulli  random  process,  as  previously  described,  is  shown  while  in  Figure 
16.5b  a  realization  of  a  Gaussian  random  process  with  Y[n\  ~  J\f( 0, 1)  is  shown. 
The  Bernoulli  random  process  is  defined  for  n  =  0, 1, . . .  (semi-infinite)  while  the 
Gaussian  random  process  is  defined  for  —  oo  <  n  <  oo  and  n  an  integer  (infinite). 
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(a)  Discrete-time/discrete- valued  (DTDV) 
Bernoulli  random  process 


(c)  Continuous-time / discrete- valued 
(CTDV)  binomial  random  process 


(b)  Discrete-time /continuous- valued 
(DTCV)  Gaussian  random  process 


Time,  t  (sec) 


(d)  Continuous-time / continuous- valued 
(CTCV)  Gaussian  random  process 


Figure  16.5:  Typical  realizations  of  different  types  of  random  processes. 


Both  these  random  processes  are  discrete-time  with  the  first  one  taking  on  only  the 
values  0  and  1  and  the  second  one  taking  on  all  real  values.  In  Figure  16.5c  is  shown 
a  random  process,  also  known  as  a  continuous-time  binomial  random  process,  which 
is  defined  as  W(t)  =  X[n],  where  X[n\  is  a  Bernoulli  random  process  and  [t] 

denotes  the  largest  integer  less  than  or  equal  to  t.  This  process  effectively  counts 
the  number  of  successes  or  ones  of  the  Bernoulli  random  process  (compare  Figure 
16.5c  with  Figure  16.5a).  It  is  defined  for  all  time;  hence,  it  is  a  continuous-time 
random  process,  and  it  takes  on  only  integer  values  in  the  range  {0, 1, . . .};  hence, 
it  is  discrete- valued.  Finally,  in  Figure  16.5d  is  shown  a  realization  of  another 
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Gaussian  process  but  with  Z(t )  ~  Af( 0, 1)  for  all  time  t.  This  is  a  continuous-time 
random  process  that  takes  on  all  real  values;  hence,  it  is  continuous-valued.  We 
will  generally  use  a  discrete-time  random  process,  with  either  discrete  or  continuous 
values,  to  introduce  new  concepts.  This  is  because  a  continuous-time  random  process 
introduces  a  host  of  mathematical  subtleties  which  in  many  cases  are  beyond  the 
scope  of  this  text.  When  possible,  however,  we  will  quote  the  analogous  results  for 
continuous-time  random  processes.  Note  finally  that  a  realization  of  X[n],  which  is 
r r[n],  is  also  called  a  sample  sequence ,  while  a  realization  of  X(t ),  which  is  x  (t),  is 
also  called  a  sample  function.  We  will,  however,  reserve  the  use  of  the  word  sample 
to  refer  to  a  time  sample  of  the  random  process.  Hence,  a  time  sample  will  refer 
to  either  the  random  variable  X[no\  ( X(to ))  or  the  realization  x[tiq ]  (x(to))  of  the 
random  process,  with  the  meaning  determined  by  the  context  of  the  discussion.  We 
next  revisit  the  random  walk  of  Example  9.5. 


Example  16.2  -  Random  walk  (continued  from  Example  9.5) 

Recall  that 

n 

Xji  —  ^  ^  Ui  Tl  —  1,2,... 

z— 1 


where 


(16.1) 


and  the  Ui  s  are  IID.  The  random  walk  is  a  random  process  so  that  rewriting  the 
definition  in  our  new  notation,  we  have 


n 

X[n\  =  U[i]  n  =  0, 1, . . . 

z=0 


where  the  U[i\ s  are  IID  random  variables  having  the  PMF  of  (16.1).  We  also 
assume  that  the  random  walk  starts  at  time  n  =  0.  The  C7[i]’s  comprise  the  random 
variables  of  a  Bernoulli  random  process  but  with  values  of  ±1,  instead  of  the  usual 
0  and  1.  As  such,  we  can  view  the  U[i]:s  as  comprising  a  Bernoulli  random  process 
U[n]  for  n  =  0, 1, . . .  .  Realizations  of  U[n ]  and  X[n ]  are  shown  in  Figure  16.6.  One 
question  that  comes  to  mind  is  the  behavior  of  the  random  walk  for  large  n.  For 
example,  we  might  be  interested  in  the  PDF  of  X  [n]  for  large  n.  Relying  on  the 
central  limit  theorem  (see  Chapter  15),  we  can  assert  that  the  PDF  is  Gaussian, 
and  therefore  we  need  only  determine  the  mean  and  variance.  This  easily  follows 
from  the  definition  of  the  random  walk  as 


E[X[n] 


'£E[U\i]]  =  (n  +  l)E[U[0}\=0 

i=0 


y]  var(?7[i])  =  (n  +  l)var(?7[0])  =  n  +  1 


var(X[n]) 
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(a)  Realization  of  Bernoulli  random  pro¬ 
cess  U[n\ 


(b)  Realization  of  random  walk  X[n] 


Figure  16.6:  Typical  realization  of  a  random  walk. 


since  2£[f7[f|]  =  0  and  var(U[i])  =  1.  (Note  that  since  the  U[i\ s  are  identically 
distributed,  they  all  have  the  same  mean  and  variance.  We  have  arbitrarily  chosen 
U[ 0]  in  the  expression  for  the  mean  and  variance  of  a  single  sample.)  Hence,  for 
large  n  we  have  approximately  that  X[n]  ~  jV(0,  n  + 1).  Does  this  appear  to  explain 
the  behavior  of  x[n]  shown  in  Figure  16.6b? 

❖ 


16.5  The  Important  Property  of  Stationarity 

The  simplest  type  of  random  process  is  an  IID  random  process.  The  Bernoulli 
random  process  is  an  example  of  this.  Each  random  variable  X[no ]  is  independent 
of  all  the  others  and  each  random  variable  has  the  same  marginal  PMF.  As  such, 
the  joint  PMF  of  any  finite  number  of  samples  can  immediately  be  written  as 

N 

PX[ni]JX[n2],...iX[riN]  [^l?  ^2?  •  •  •  ?  3*n\  J_.PX[7ij]  fail  (16.2) 

i= 1 

and  used  for  probability  calculations.  For  example,  for  a  Bernoulli  random  process 
with  values  0, 1  the  probability  of  the  first  10  samples  being  1, 0, 1, 0, 1, 0, 1, 0, 1, 0  is 
p5(  1  —  p )5.  Note  that  we  are  able  to  specify  the  joint  PMF  for  any  finite  number 
of  sample  times.  This  is  sometimes  referred  to  as  being  able  to  specify  the  finite 
dimensional  distribution  (FDD).  It  is  the  most  complete  probabilistic  description 
that  we  can  manage  for  a  random  process  and  reduces  the  analysis  of  a  random 
process  to  the  analysis  of  a  finite  but  arbitrary  set  of  random  variables. 
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A  generalization  of  the  IID  random  process  is  a  random  process  for  which  the 
FDD  does  not  change  with  the  time  origin.  This  is  to  say  that  the  PMF  or  PDF 
of  the  samples  {X[ni\,  X[ri2\, . . . , X[nx ]}  is  the  same  as  for  {X[n\  +  no],X[ri2  + 
no], . . . ,  X[riN  4-  no]},  where  no  is  an  arbitrary  integer.  Alternatively,  the  set  of 
samples  can  be  shifted  in  time,  with  each  one  being  shifted  the  same  amount,  without 
affecting  the  joint  PMF  or  joint  PDF.  Mathematically,  for  the  FDD  not  to  change 
with  the  time  origin,  we  must  have  that 


PX[ni+no],X[n2+no],...,X[nN+no]  PX[n\\,X[n2\,-.;X[nN]  (16.3) 

for  all  no,  and  for  any  arbitrary  choice  of  N  and  ni,n2, . . .  ,  n^y.  Such  a  random 
process  is  said  to  be  stationary.  It  is  implicit  from  (16.3)  that  all  joint  and  marginal 
PMFs  or  PDFs  must  have  probabilities  that  do  not  depend  on  the  time  origin.  For 
example,  by  letting  N  =  1  in  (16.3)  we  have  that  Px[m+no\  —  Px[m\  and  setting 
ni  =  0,  we  have  that  Px[n0\  —  Px[ o]  f°r  all  no-  This  says  that  the  marginal  PMF  or 
PDF  is  the  same  for  every  sample  in  a  stationary  random  process.  We  next  prove 
that  an  IID  random  process  is  stationary. 

Example  16.3  -  IID  random  process  is  stationary. 

To  prove  that  the  IID  random  process  is  a  special  case  of  a  stationary  random 
process  we  must  show  that  (16.3)  is  satisfied.  This  follows  from 

PX[ni+no],X[n2+no]i...iX[nN+no] 


0 

If  a  random  process  is  stationary,  then  all  its  joint  moments  and  more  generally  all 
expected  values  of  functions  of  the  random  process,  must  also  be  stationary  since 


N 

W^PX[rii+no\ 
i—  1 
N 

ILpw 

i=  1 

PX[m],X[ri2],...,X[nN] 


(by  independence) 


(by  identically  distributed) 
(by  independence). 


Ex[ni+no\,...,X[riN+no]['}  ~  -®'X[rai],...,X[njv] [’] 

which  follows  from  (16.3).  Examples  then  of  random  processes  that  are  not  station¬ 
ary  are  ones  whose  means  and/or  variances  change  in  time,  which  implies  that  the 
marginal  PMF  or  PDF  change  with  time.  In  Figure  16.7  we  show  typical  realiza¬ 
tions  of  random  processes  whose  mean  in  Figure  16.7a  and  whose  variance  in  Figure 
16.7b  change  with  time.  They  were  generated  using  the  MATLAB  code: 

randnC* state ’ ,0) 

N=51 ; 

x=randn(N,l)+0.1*[0:N-l]  ’ ;  °/0  for  Figure  16.7a 
y=sqrt(0.95.  ~  [0:50]  *)  .*randn(N,l) ;  7,  for  Figure  16.7b 
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(a)  Mean  increasing  with  n 


(b)  Variance  decreasing  with  n 


Figure  16.7:  Random  processes  that  are  not  stationary. 


In  Figure  16.7a  the  true  mean  increases  linearly  from  0  to  5  while  in  Figure  16.7b  the 
variance  decreases  exponentially  as  0.95n.  It  is  clear  then  that  the  samples  all  have 
different  moments  and  therefore  Px[ni+n0\  ^  Px[m]  which  violates  the  condition  for 
stationarity. 


A 


It  is  impossible  to  determine  if  a  random  process  is  stationary 


from  a  single  realization. 


A  realization  of  a  random  process  is  a  single  outcome  of  the  random  process.  This  is 
analogous  to  observing  a  single  outcome  of  a  coin  toss.  We  cannot  determine  if  the 
coin  is  fair  by  observing  that  the  outcome  was  a  head.  What  is  required  are  multiple 
realizations  of  the  coin  tossing  experiment.  So  it  is  with  random  processes.  In  Figure 
16.7b,  although  we  generated  the  realization  using  a  variance  that  decreased  with 
time,  and  hence  the  random  process  is  not  stationary,  the  realization  shown  could 
have  been  generated  with  a  constant  variance.  Then,  the  values  of  the  realization 
near  n  =  50  just  happen  to  be  smaller  than  the  ones  near  n  =  0,  which  is  possible, 
although  maybe  not  very  probable.  To  better  discern  whether  a  random  process  is 
stationary  we  require  multiple  realizations. 


Another  example  of  a  random  process  that  is  not  stationary  follows. 

Example  16.4  —  Sum  random  process 

A  sum  random  process  is  a  slight  generalization  of  the  random  walk  process  of 
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Example  16.2.  As  before,  X[n\  —  where  the  J7[i]’s  are  IID  but  for  the 

general  sum  process,  the  f7[i]’s  can  have  any,  although  the  same,  PMF  or  PDF. 
Thus,  the  sum  random  process  is  not  stationary  since 

E[X[n}\  =  (n  +  l)Eu[U[0}\ 
var(X[n])  =  (n  +  l)var(?7[0]) 

both  of  which  change  with  n.  Hence,  it  violates  the  condition  for  stationarity. 

0 

A  random  process  that  is  not  stationary  is  said  to  be  nonstationary.  In  light  of 
the  fact  that  an  IID  random  process  lends  itself  to  simple  probability  calculations, 
it  is  advantageous,  if  possible,  to  transform  a  nonstationary  random  process  into  a 
stationary  one  (see  Problem  16.12  on  transforming  the  random  processes  of  Figure 
16.7  into  stationary  ones).  As  an  example,  for  the  sum  random  process  this  can  be 
done  by  “reversing”  the  summing  operation.  Specifically,  we  difference  the  random 
process.  Then  X[n]  —  X[n  —  1]  =  U[n\  for  n  >  0,  where  we  define  X[—  1]  =  0. 
This  is  an  IID  random  process.  The  differences  or  increment  random  variables  U[n] 
are  independent  and  identically  distributed.  More  generally,  for  the  sum  random 
process  any  two  increments  of  the  form 

712 

X[n2]-X[ni]  = 

i=m+i 

714 

X[n4]-X[n3]  =  U[i] 

7=713  +  1 

are  independent  if  >  n$  >  n<i  >  n\.  Thus,  nonoverlapping  increments  for  a  sum 
random  process  are  independent.  (Recall  that  functions  of  independent  random 
variables  are  themselves  independent.)  If  furthermore,  n^  —  ns  =  n2  —  ni,  then  they 
also  have  the  same  PMF  or  PDF  since  they  are  composed  of  the  same  number  of  IID 
random  variables.  It  is  then  said  that  for  the  sum  random  process,  the  increments 
are  independent  and  stationary  (equivalent  to  being  identically  distributed)  or  that 
it  has  stationary  independent  increments.  The  reader  may  wish  to  ponder  whether 
a  random  process  can  have  independent  but  nonstationary  increments  (see  Problem 
16.13).  Many  random  processes  (an  example  of  which  follows)  that  we  will  encounter 
have  this  property  and  it  allows  us  to  more  easily  analyze  the  probabilistic  behavior. 

Example  16.5  -  Binomial  counting  random  process 

Consider  the  repeated  coin  tossing  experiment  where  we  are  interested  in  the  num¬ 
ber  of  heads  that  occurs.  Letting  U[n\  be  a  Bernoulli  random  process  with  U[n]  =  1 
with  probability  p  and  U[n]  =  0  with  probability  1  —  p,  the  number  of  heads  is  given 
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by  the  binomial  counting  or  sum  process 

n 

X[n]  =  Y^U[i]  n  =  0, 1, . . . 

i= 0 


or  equivalently 


f  U[ 0]  n  =  0 

\  X[n  —  1]  +  U[n\  n  >  1 . 


A  typical  realization  is  shown  in  Figure  16.8.  The  random  process  has  stationary 


n 


Figure  16.8:  Typical  realization  of  binomial  counting  random  process  with  p  =  0.5. 

and  independent  increments  since  the  changes  over  two  nonoverlapping  intervals 
are  composed  of  different  sets  of  identically  distributed  J7[i]’s.  We  can  use  this 
property  to  more  easily  determine  probabilities  of  events.  For  example,  to  determine 
Px[ i],x[2][1?2]  =  P[X[1]  =  1, X[2]  =  2],  we  can  note  that  the  event  X[l\  =  1,X[2]  = 
2  is  equivalent  to  the  event  Y\  —  X[l\  —  X[—  1]  =  1,  =  X[2]  —  X[l]  =  1,  where 

X[—l]  is  defined  to  be  identically  zero.  But  Y\  and  Y<l  are  nonoverlapping  increments 
(but  of  unequal  length),  making  them  independent  random  variables.  Thus, 


P[X[  1]  =  1,X[2]  =2]  = 


P\Yi  =  l,y2  =  1]  =  P[Yi  =  1  ]P[Y2  =  1] 
P[U[ 0]  +  U[  1]  -  1]P[U[2]  =  1] 

" - v - - 

bin(2,p) 

(i)p1^  ~p ^  'p 

2p2(l  -p). 
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16.6  Some  More  Examples 

We  continue  our  discussion  by  examining  some  random  processes  of  practical  interest. 

Example  16.6  —  White  Gaussian  noise 

A  common  model  for  physical  noise,  such  as  resistor  noise  due  to  electron  motion 
fluctuations  in  an  electric  field,  is  termed  white  Gaussian  noise  (WGN).  It  is  assumed 
that  the  noise  has  been  sampled  in  time  to  yield  a  DTCV  random  process  X[n\.  The 
WGN  random  process  is  defined  to  be  an  IID  one  whose  marginal  PDF  is  Gaussian 
so  that  X[n\  ~  J\f( 0,cr2)  for  —  oo  <  n  <  oo.  Each  random  variable  X[no]  has  a 
mean  of  zero,  consistent  with  our  notion  of  a  noise  process,  and  the  same  variance 
or  because  the  mean  is  zero,  the  same  power  l?[X2[no]].  A  typical  realization  is 
shown  in  Figure  16.5b  for  a2  =  1.  The  WGN  random  process  is  stationary  since  it 
is  an  IID  random  process.  Its  joint  PDF  is 


PX[ni],X[n2],...,X[njv]  (^1  ’  *^2 5  •  •  •  5  %n) 


N 


I ~[px[ni}(xi) 


i—  1 
N 


n72bexp 

i=l 


(27 T(J2)N/2 


exp 


(16.4) 


Note  that  the  joint  PDF  is  A/”(0,  cr2I),  which  is  a  special  form  of  the  multivariate 
Gaussian  PDF  (see  Problem  16.15).  The  terminology  of  “white”  derives  from  the 
property  that  such  a  random  process  may  be  synthesized  from  a  sum  of  different 
frequency  random  sinusoids  each  having  the  same  power,  much  the  same  as  white 
light  is  composed  of  equal  contributions  of  each  visible  wavelength  of  light.  We  will 
justify  this  property  in  Chapter  17  when  we  discuss  the  power  spectral  density. 

❖ 


Example  16.7  —  Moving  average  random  process 

The  moving  average  (MA)  random  process  is  a  DTCV  random  process  defined  as 

X[n\  =  \{U[n\  +  U[n  —  1])  —  oo  <  n  <  oo 

where  U[n ]  is  a  WGN  random  process  with  variance  afj.  (To  avoid  confusion  with 
the  variance  of  other  random  variables  we  will  sometimes  use  a  subscript  on  a2,  in 
this  case  <r^,  to  refer  to  the  variance  of  the  U[no]  random  variable.)  The  terminology 
of  moving  average  refers  to  the  averaging  of  the  current  random  variable  U[n]  with 
the  previous  random  variable  U[n  —  1]  to  form  the  current  moving  average  random 
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variable.  Also,  this  averaging  “moves” 

in  time,  as  for  example, 

X[0]  = 

i([/[0] +  [/[—!]) 

X[l]  = 

um+m) 

m  = 

\{U[2]  +  U[l}) 

etc. 


A  typical  realization  of  X[n\  is  shown  in  Figure  16.9  and  should  be  compared  to 
the  realization  of  U[n]  shown  in  Figure  16.5b.  It  is  seen  that  the  moving  average 
random  process  is  “smoother”  than  the  WGN  random  process,  from  which  it  was 
obtained.  Further  smoothing  is  possible  by  averaging  more  WGN  samples  together 
(see  Problem  16.17).  The  MATLAB  code  shown  below  was  used  to  generate  the 
realization. 

randn(’ state’ ,0) 
u=randn(21 , 1) ; 
for  i=l : 21 
if  i==l 

x(i,l)=0.5*(u(l)+randn(l,l))  ;  7o  needed  to  initialize  sequence 
else 

x(i,l)=0.5*(u(i)+u(i-l))  ; 
end 
end 


Figure  16.9:  Typical  realization  of  moving  average  random  process.  The  realization 
of  the  U[n\  random  process  is  shown  in  Figure  16.5b. 

The  joint  PDF  of  X[n\  can  be  determined  by  observing  that  it  is  a  linearly  trans¬ 
formed  version  of  U[n\.  As  an  example,  to  determine  the  joint  PDF  of  the  random 
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vector  [X[0]  X[1]]T,  we  have  from  the  definition  of  the  MA  random  process 


■  U[- 1] 

U[  0] 


or  in  matrix/ vector  notation  X  =  GU.  Now  recalling  that  U  is  a  Gaussian  random 
vector  (see  (16.4))  and  that  a  linear  transformation  of  a  Gaussian  random  vector 
produces  another  Gaussian  random  vector,  we  have  from  Example  14.3  that 


X  ~  Af(GE\U],  GCuGt). 


Explicitly,  since  each  sample  of  U[n]  is  zero  mean  with  variance  a ^  and  all  samples 
are  independent,  we  have  that  22 [U]  =  0  and  C u  =  or^I.  This  results  in 


~  N(0,GjjGGt) 


■  i  l  ' 

2  4 

I  I 

.4  2  . 

It  can  furthermore  be  shown  that  the  MA  random  process  is  stationary  (see  Example 
20.2  and  Property  20.2). 

❖ 


where 


nn 


Example  16.8  -  Randomly  phased  sinusoid  (or  sine  wave) 

Consider  the  DTCV  random  process  given  as 

X[n\  =  cos(27r(0.1)n  +  0)  —  oo  <  n  <  oo 

where  0  ~  Z2(0, 27r).  Some  typical  realizations  are  shown  in  Figure  16.10.  The  MAT- 
LAB  statements  n=[0:31]  ’  and  x=cos(2*pi*0.1*n+2*pi*rand(l,l))  can  be  used 
to  generate  each  realization.  This  random  process  is  frequently  used  to  model  an 
analog  sinusoid  whose  phase  is  unknown  and  that  has  been  sampled  by  an  analog-to- 
digital  convertor.  It  is  nearly  a  deterministic  signal,  except  for  the  phase  uncertainty, 
and  is  therefore  perfectly  predictable.  This  is  to  say  that  once  we  observe  two  suc¬ 
cessive  samples,  then  all  the  remaining  ones  are  known  (see  Problem  16.20).  This  is 
in  contrast  to  the  WGN  random  process,  for  which  regardless  of  how  many  samples 
we  observe,  we  cannot  predict  any  of  the  remaining  ones  due  to  the  independence 
of  the  samples.  Because  of  the  predictability  of  the  randomly  phased  sinusoidal 
process,  the  joint  PDF  can  only  be  represented  using  impulsive  functions.  As  an  ex¬ 
ample,  you  might  try  to  find  the  PDF  of  (X,  Y)  if  (X,  Y)  has  the  bivariate  Gaussian 


16.6.  SOME  MORE  EXAMPLES 


531 


(a)  e  =  5.9698 


(b)  0  =  1.4523 


ii 

■'! 

1 

'l 
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n 

(c)  0  =  3.8129 


Figure  16.10:  Typical  realizations  for  randomly  phased  sinusoid. 


PDF  with  p  =  1.  We  will  not  pursue  this  further.  However,  we  can  determine  the 
marginal  PDF  px[n\-  To  do  so  we  use  the  transformation  formula  of  (10.30),  where 
the  Y  random  variable  is  X[no\  (considering  the  random  process  at  a  fixed  time) 
and  the  X  random  variable  is  ©.  The  transformation  is  shown  in  Figure  16.11  for 
no  =  0.  Note  that  there  are  two  solutions  for  any  given  a;  [no]  =  y  (except  for  the 


Figure  16.11:  Function  transforming  ©  into  X[no]  for  the  value  no  =  0,  where 
X[no\  —  cos(27r(0.1)no  +  ©). 

point  at  6  =  7r,  which  has  probability  zero).  We  denote  the  solutions  as  9  =  £1,  ab¬ 
using  our  previous  notation  of  y  =  g(x)  for  a  transformation  of  a  single  random 
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variable  we  have  that 
so  that  the  solutions  are 


y  —  cos(27r(0.1)no  +  x) 


Xi  =  arccos(y)  -  27r(0.1)n0  =  yj~  (y) 

X2  =  27T  -  [arccos(y)  -  27r(0.1)n0]  =  y^1(y) 

for  —  1  <  y  <  1  and  thus  0  <  arccos(y)  <  n.  Using  darccos(y)/dy  =  l/y/l  —  y2,  we 
have 


py(v)  =  Px{g11(y )) 


dfh  1(P) 

dy 


+px{g2 1(y)) 


dg2 1  (v) 

dy 


1 

i 

l 

4- 

1 

2ty 

%/f  -y2 

+  2  7T 

V1  -y1 

1 


-Ky/l  -  y2 


Finally,  in  our  original  notation  we  have  the  marginal  PDF  for  X[n\  for  any  n 


PX[n]  (*)  = 


1 

7T\/1— X2 

0 


— 1  <  x  <  1 
otherwise. 


This  PDF  is  shown  in  Figure  16.12.  Note  that  the  values  of  X[n]  that  are  most 
probable  are  near  x  =  =fcl.  Can  you  explain  why?  (Hint:  Determine  the  values  of  9 
for  which  0.9  <  cos#  <  1  and  also  0  <  cos#  <  0.1  in  Figure  16.11.) 


Figure  16.12:  Marginal  PDF  for  randomly  phased  sinusoid. 


16.7.  JOINT  MOMENTS 


533 


16.7  Joint  Moments 

The  first  and  second  moments  or  equivalently  the  mean  and  variance  of  a  random 
process  at  a  given  sample  time  are  of  great  practical  importance  since  they  are  easily 
determined.  Also,  the  covariance  between  two  samples  of  the  random  process  at  two 
different  times  is  easily  found.  At  worst,  the  first  and  second  moments  can  always 
be  estimated  in  practice.  This  is  in  contrast  to  the  joint  PMF  or  joint  PDF,  which 
in  practice  may  be  difficult  to  determine.  Hence,  we  next  define  and  give  some 
examples  of  the  mean,  variance,  and  covariance  sequences  for  a  DTCV  random 
process.  The  mean  sequence  is  defined  as 

px[n]  —  J5[AT[n]]  —  oo  <  n  <  oo  (16.5) 

while  the  variance  sequence  is  defined  as 

ax[n]  =  var {X[n])  —  oo  <  n  <  oo  (16.6) 

and  finally  the  covariance  sequence  is  defined  as 
cx[ni,n2]  =  cov(X[ni],X[n2]) 

=  BPH-mW)(XH-bW)]  -oo  <  ni<  oo  (16  7) 

— oo  <  n2  <  oo . 

The  expectations  for  the  mean  and  variance  are  taken  with  respect  to  the  PMF  or 
PDF  Px[n\  f°r  a  particular  value  of  n.  Similarly,  the  expectation  needed  for  the 
evaluation  of  the  covariance  is  with  respect  to  the  joint  PMF  or  PDF  Px[ni],x[n2\ 
for  particular  values  of  n\  and  n2.  Since  the  required  PMF  or  PDF  should  be  clear 
from  the  context,  we  henceforth  do  not  subscript  the  expectation  operator  as  we 
have  done  so  previously.  Note  that  the  usual  symmetry  property  of  the  covariance 
holds,  which  results  in  cx[n2,ni]  =  cx[ni,ra2].  Also,  it  follows  from  the  definition 
of  the  covariance  sequence  that  cx[n,ri\  =  a ^[n].  The  actual  evaluation  of  the 
moments  proceeds  exactly  the  same  as  for  random  variables. 

If  the  random  process  is  a  continuous-time  one,  then  the  corresponding  defini¬ 
tions  are 

vx{t)  =  E[X(t)] 

°x(t)  =  var(X(*)) 

cx{ti,t2)  =  E[(X(ti)  -  nx{t\)){X{t2)  -  ^xih))]- 

These  are  called  the  mean  function ,  variance  function ,  and  covariance  function , 
respectively.  We  next  examine  the  moments  for  the  examples  of  the  previous  section. 
Noting  that  the  variance  is  just  the  covariance  sequence  evaluated  at  n\  =  n2  =  n, 
we  need  only  determine  the  mean  and  covariance  sequences. 
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Example  16.9  —  White  Gaussian  noise 

Since  X[n\  ~  Af( 0,  a2)  for  all  n,  we  have  that 

Hx[n\  —  0  —  oo  <  n  <  oo 

o\  [n]  =  a2  —  oo  <  n  <  oo. 


The  covariance  sequence  for  n\  /  n 2  must  be  zero  since  the  random  variables  are 
all  independent.  Recalling  that  the  covariance  between  X[n]  and  itself  is  just  the 
variance,  we  have  that 


cx[ni,n2] 


0  ni  /n2 
a2  rii  =  . 


This  can  be  written  in  more  succinct  form  by  using  the  discrete  delta  function  as 

cx[ni,n2]  =  a28[n2  -  nx]. 


In  summary,  for  a  WGN  random  process  we  have  that  Hx[n]  =  0  for  all  n  and 
cx[ni,n2]  =  a25[n2  -  m]. 


Example  16.10  —  Moving  average  random  process 

The  mean  sequence  is 


lix]p\  —  ^[^[n]]  =  E[\(U[n\  +  17 [n  —  1])]  =  0  —  oo  <  n  <  oo 


since  U[n]  is  white  Gaussian  noise,  which  has  a  zero  mean  for  all  n.  To  find  the 
covariance  sequence  using  X[n ]  =  (U[n]  +  U[n  —  l])/2,  we  have 


cx[ni,n2]  =  E[(X[ni]  -  nx[ni])(X[n2]  -  /J,x[n2])] 

=  E[X[ni}X[n2]] 

=  \e[(U[th]  +  u[m  -  l])(U[n2]  +  U[n2  -  1])] 

=  \  (E[U[ni}U[n2}]  +  E[U[ni]U[n2  -  1]] 

+E[U[ ni  -  1  ]U[n2}\  +  E[U[ni  -  l]U[n2  -  1]]) 


But  I£|77[A;]C/’[Z]]  =  cff:6[l  —  k]  since  U[n]  is  WGN,  and  as  a  result 


cx  [ni ,  n2]  = 


1 

-  (crij8[n2  -  ni]  +  afj8[n2  -  1  -  ni]  +  Oij5[n2  -  m  +  1]  +  (Tij8[n2  -  ni]) 

2  2  2 

8[n2  —  ni]  +  - j-S[n2  —  m  —  1]  +  -j-S[n2  -n\+  1]. 


This  is  plotted  in  Figure  16.13  versus  An  =  n2  —  n\.  It  is  seen  that  the  covariance 
sequence  is  zero  unless  the  two  samples  are  at  most  one  unit  apart  or  An  =  n2— n\  = 
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Figure  16.13:  Covariance  sequence  for  moving  average  random  process. 


±1.  Note  that  the  covariance  between  any  two  samples  spaced  one  unit  apart  is  the 
same.  Thus,  for  example,  X[l\  and  X[2]  have  the  covariance  cx[  1,2]  =  a^/ 4, 
as  do  X[9]  and  X[10]  since  cx[9, 10]  —  cr^/4,  and  as  do  X[—3]  and  X[— 2]  since 
cx[~ 3,  —2]  =  <7^/4  (see  Figure  16.13).  Any  samples  that  are  spaced  more  than  one 
unit  apart  are  uncorrelated.  This  is  because  for  \n2  —  n\\  >  1,  X[n{\  and  X[ri2] 
are  independent,  being  composed  of  two  sets  of  different  WGN  samples  (recall  that 
functions  of  independent  random  variables  are  independent).  In  summary,  we  have 
that 


Hx  [n]  = 
cx-[ni,n2]  = 


-f-  ni=n,2 

jri2  —  ni|  =  1 
0  |n2  —  ni|  >  1 . 


and  the  variance  is  cx[n,  n]  =  o'frj‘l  for  all  n.  Also,  note  from  Figure  16.13  that  the 
covariance  sequence  is  symmetric  about  An  =  0. 

❖ 


Example  16.11  -  Randomly  phased  sinusoid 

Recalling  that  the  phase  is  uniformly  distributed  on  (0, 2tt)  we  have  that  the  mean 
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sequence  is 

Hx[n]  =  E[X[n]]  =  £?[cos(27r(0.1)n  +  0)] 

2ir  -j 

cos(27r(0.1  )n  +  9)—d9  (use  (11.10)) 

Z7T 
27 r 

=  —  sin(27r(0.1)n  +  9)  =0 

2n  0 

for  all  n.  Noting  that  the  mean  sequence  is  zero,  the  covariance  sequence  becomes 


cx[ni,n2]  =  E[X[ni]X[n2]] 

r2n  i 

=  /  [cos(27r(0.1)ni  +  9)  cos(27r(0.1)n2  +  9)]  —d9 

Jo  27T 

r2n  r 1  1  1  1 

=  -  cos[27r(0.1)(n2  —  ni)]  +  -  cos[27r(0.1)(ni  -f  n 2)  +  29]  —d9 

Jo  L2  2  J  27r 

1  1  27r 

=  -  cos[27r(0.1)(n2  -  ni)]  +  —  sin[27r(0.1)(ni  +  n2)  +  29] 

2  o7T  q 

=  ^  cos[27r(0.1)(n2  -  ni)]. 

Once  again  the  covariance  sequence  depends  only  on  the  spacing  between  the  two 


-10  -5  0  5  10 

An  —  ri2  —  n\ 


Figure  16.14:  Covariance  sequence  for  randomly  phased  sinusoid. 

samples  or  on  712  —  n\.  The  covariance  sequence  is  shown  in  Figure  16.14.  The 
reader  should  note  the  symmetry  of  the  covariance  sequence  about  An  =  0.  Also, 
the  variance  follows  as  a 2x[n]  =  cx[n,n]  =  1/2  for  all  n.  It  is  interesting  to  observe 
that  in  this  example  the  fact  that  the  mean  sequence  is  zero  makes  intuitive  sense. 
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To  see  this  we  have  plotted  50  realizations  of  the  random  process  in  an  overlaid 
fashion  in  Figure  16.15.  This  representation  is  called  a  scatter  diagram.  Also  is 


Figure  16.15:  Fifty  realizations  of  randomly  phased  sinusoid  plotted  in  an  overlaid 
format  with  one  realization  shown  with  its  points  connected  by  straight  lines. 

plotted  the  first  realization  with  the  values  connected  by  straight  lines  for  easier 
viewing.  The  difference  in  the  realizations  is  due  to  the  different  values  of  phase 
realized.  It  is  seen  that  for  a  given  time  instant  the  values  are  nearly  symmetric 
about  zero,  as  is  predicted  by  the  PDF  shown  in  Figure  16.12  and  that  the  majority 
of  the  values  are  near  d=l,  again  in  agreement  with  the  PDF.  The  MATLAB  code 
used  to  generate  Figure  16.15  (but  omitting  the  solid  curve)  is  given  below. 

clear  all 
rand ('state* ,0) 
n=  [0:31] * ; 
nreal=50; 
for  i=l:nreal 

x(: ,i)=cos(2*pi*0.1*n+2*pi*rand(l,l)) ; 

end 

plot (n,x( : ,1) , * . ') 

grid 

hold  on 

for  i=2:nreal 

plot (n,x( : ,i)  ,  * .  * ) 
end 

axis(  [0  31  -1.5  1.5]) 
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In  these  three  examples  the  covariance  sequence  only  depends  on  \rt2  —  n\\.  This  is 
not  always  the  case,  as  is  illustrated  in  Problem  16.26.  Also,  another  counterexample 
is  the  random  process  whose  realization  is  shown  in  Figure  16.7b.  This  random 
process  has  var (X[n])  =  cx[n,n\  which  is  not  a  function  of  ri2  —  n\  =  n  —  n  =  0 
since  otherwise  its  variance  would  be  a  constant  for  all  n. 


16.8  Real-World  Example  -  Statistical  Data  Analysis 

It  was  mentioned  in  the  introduction  that  some  meterologists  argue  that  the  annual 
summer  rainfall  totals  are  increasing  due  to  global  warming.  Referring  to  Figure 
16.1  this  supposition  asserts  that  if  X[n ]  is  the  annual  summer  rainfall  total  for  year 
n,  then  /ix[^2]  >  Hx[n\]  for  ri2  >  n\.  One  way  to  attempt  to  confirm  or  dispute 
this  supposition  is  to  assume  that  /ixM  =  an  +  b  and  then  determine  if  a  >  0,  as 
would  be  the  case  if  the  mean  were  increasing.  From  the  data  shown  in  Figure  16.1 
we  can  estimate  a.  To  do  so  we  let  the  year  1895,  which  is  the  beginning  of  our  data 
set,  be  indexed  as  n  =  0  and  note  that  an  +  b  when  plotted  versus  n  is  a  straight 
line.  We  estimate  a  by  fitting  a  straight  line  to  the  data  set  using  a  least  squares 
procedure  [Kay  1993].  The  least  squares  estimate  chooses  as  estimates  of  a  and  b 
the  values  that  minimize  the  least  squares  error 


N-l 

J(a,  b )  =  (x[n]  —  (an  +  b))2  (16.8) 

71=0 


where  N  =  108  for  our  data  set.  This  approach  can  be  shown  to  be  an  optimal 
one  under  the  condition  that  the  random  process  is  actually  given  by  X[n ]  =  an  + 
b  +  17 [n],  where  U[n\  is  a  WGN  random  process  [Kay  1993].  Note  that  if  we  did 
not  suspect  that  the  mean  rainfall  totals  were  changing,  then  we  might  assume  that 
Hx  [n\  —  b  and  the  least  squares  estimate  of  b  would  result  from  minimizing 


N-l 

J{b )  =  (*[*»]  ~b)2. 

71=0 


If  we  differentiate  J ( b )  with  respect  to  b,  set  the  derivative  equal  to  zero,  and  solve 
for  b,  we  obtain  (see  Problem  16.32) 


N-l 

N  ^  x[n] 

71=0 


A 

or  b  =  where  x  is  the  sample  mean,  which  for  our  data  set  is  9.76.  Now,  however, 
we  obtain  the  least  squares  estimates  of  a  and  b  by  differentiating  (16.8)  with  respect 
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to  b  and  a  to  yield 


dJ 

db 

dJ_ 

da 


N- 1 


=  —2  (:r[n]  —  an  —  b)  =  0 


n= 0 
N—l 


=  —2  ^2  (x[n\  —  an  —  b)n  —  0. 


n=0 


This  results  in  two  simultaneous  linear  equations 


N—l 


N- 1 


bN  +  a  = 


iV-l 


n=0 

N—l 


n= 0 
AT— 1 


n2  =  V'  m[n] 


n=0 


n=0 


71=0 


In  vector/matrix  form  this  is 


Et'o1 » 


En=0  n 

Et'o1  n2 


T,n=o  x[n\ 
En=o  n»[n] 


(16.9) 


A 

which  is  easily  solved  to  yield  the  estimates  b  and  a.  For  the  data  of  Figure  16.1 

A 

the  estimates  are  a  =  0.0173  and  b  =  8.8336.  The  data  along  with  the  estimated 
mean  sequence  (ix[n\  —  0.0173n  +  8.8336  are  shown  in  Figure  16.16.  Note  that  the 


1900  1920  1940  1960  1980  2000 


Year 

Figure  16.16:  Annual  summer  rainfall  in  Rhode  Island  and  the  estimated  mean 
sequence,  —  0.0173n  +  8.8336,  where  n  =  0  corresponds  to  the  year  1895. 

mean  indeed  appears  to  be  increasing  with  time.  The  least  squares  error  sequence, 
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A 

which  is  defined  as  e[n]  =  x[n\  —  (an  +  6),  is  shown  in  Figure  16.17.  It  is  sometimes 
referred  to  as  the  fitting  error.  Note  that  the  error  can  be  quite  large.  In  fact,  we 


n 

Figure  16.17:  Least  squares  error  sequence  for  annual  summer  rainfall  in  Rhode 
Island  fitted  with  a  straight  line. 

have  that  (1  /N)  e2[n]  —  10.05. 

Now  the  real  question  is  whether  the  estimated  mean  increase  in  rainfall  is 
significant.  The  increase  is  a  =  0.0173  per  year  for  a  total  increase  of  about  1.85 
inches  over  the  course  of  108  years.  Is  it  possible  that  the  true  mean  rainfall  has 
not  changed,  or  that  it  is  really  px[n\  =  b  with  the  true  value  of  a  being  zero? 
In  effect,  is  the  value  of  a  =  0.0173  only  due  to  estimation  error?  One  way  to 
answer  this  question  is  to  hypothesize  that  a  =  0  and  then  determine  the  probability 
density  function  of  a  as  obtained  from  (16.9).  This  can  be  done  analytically  by 
assuming  X[n]  =  b  +  ?7[n],  where  U[n]  is  white  Gaussian  noise  (see  Problem  16.33). 
However,  we  can  gain  some  quick  insight  into  the  problem  by  resorting  to  a  computer 
simulation.  To  do  so  we  assume  that  the  true  model  for  the  rainfall  data  is  X[n]  = 
b  +  U[n\  =  9.76  +  [7[n],  where  U[n\  is  white  Gaussian  noise  with  variance  a2.  Since 
we  do  not  know  the  value  of  a2 ,  we  estimate  it  by  using  the  results  shown  in  Figure 
16.17.  The  least  squares  error  sequence  e[n],  which  is  the  original  data  with  its 
estimated  mean  sequence  subtracted,  should  then  be  an  estimate  of  U[n].  Therefore, 

we  use  or2  —  (1  /N)  Yln=o  e2[n]  =  10.05  in  our  simulation.  In  summary,  we  generate 
20  realizations  of  the  random  process  X[n\  —  9.76  +  J7[n],  where  U[n]  is  WGN  with 
cr2  =  10.05.  Then,  we  use  (16.9)  to  estimate  a  and  b  and  finally  we  plot  our  mean 

A 

sequence  estimate,  which  is  fix[n]  =  an+b  for  each  realization.  Using  the  MATLAB 
code  shown  at  the  end  of  this  section,  the  results  are  shown  in  Figure  16.18.  It  is 
seen  that  even  though  the  true  value  of  a  is  zero,  the  estimated  value  will  take  on 
nonzero  values  with  a  high  probability.  Since  some  of  the  lines  are  decreasing,  some 
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A 

Figure  16.18:  Twenty  realizations  of  the  estimated  mean  sequence  p>x[n]  =  an  +  b 
based  on  the  random  process  X[n]  =  9.76  +  U[n\  with  U[n\  being  WGN  with  a2  = 
10.05.  The  realizations  are  shown  as  dashed  lines.  The  estimated  mean  sequence 
from  Figure  16.16  is  shown  as  the  solid  line. 

of  the  estimated  values  of  a  are  even  negative.  Hence,  we  would  be  hard  pressed  to 
say  that  the  mean  rainfall  totals  are  indeed  increasing.  Such  is  the  quandry  that 
scientists  must  deal  with  on  an  everyday  basis.  The  only  way  out  of  this  dilemma  is 
to  accumulate  more  data  so  that  hopefully  our  estimate  of  a  will  be  more  accurate 
(see  also  Problem  16.34). 

clear  all 
randnC* state  * ,0) 
years= [1895 : 2002] ’ ; 

N=length (years) ; 
n=  [0 :N-1] ’ ; 

A=[N  sum(n)  ; sum(n)  sum(n.~2)];  7*  precompute  matrix  (see  (16.9)) 
B=inv(A)  ;  °/0  invert  matrix 
for  i=l : 20 

xn=9.76+sqrt(10.05)*randn(N,l) ;  7.  generate  realizations 
baest=B* [sum(xn)  ;sum(n.*xn)]  ;  7,  estimate  a  and  b  using  (16.9) 
aest=baest (2) ;best=baest (1) ; 

meanest (: ,i)=aest*n+best;  7.  determine  mean  sequence  estimate 
end 

figure  7o  plot  mean  sequence  estimates  and  overlay 

plot  (n, meanest  (  :  ,1)) 

grid 

xlabel( *n* ) 
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ylabel( ’Estimated  mean’) 
axis(  [0  107  5  15]) 
hold  on 
for  i=2 : 20 

plot (n, meanest ( : ,i)) 
end 
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Problems 

16.1  (^)  (w)  Describe  a  random  process  that  you  are  likely  to  encounter  in  the 
following  situations: 

a.  listening  to  the  daily  weather  forecast 

b.  paying  the  monthly  telephone  bill 

c.  leaving  for  work  in  the  morning 

Why  is  each  process  a  random  one? 

16.2  (w)  A  single  die  is  tossed  repeatedly.  What  are  S  and  <Sx?  Also,  can  you 
determine  the  joint  PMF  for  any  N  sample  times? 

16.3  (t)  An  infinite  sequence  of  0’s  and  l’s,  denoted  as  61,62, .. .,  can  be  used  to 
represent  any  number  x  in  the  interval  [0, 1]  using  the  binary  representation 
formula 

00 


i= 1 


For  example,  we  can  represent  3/4  as  O.61&2  •  •  •  =  0.11000...  and  1/16  as 
O.61 62  .  •  •  =  0.0001000 ....  Find  the  representations  for  7/8  and  5/8.  Is  the 
total  number  of  infinite  sequences  of  0’s  and  l’s  countable? 

16.4  (o)  (w)  For  a  Bernoulli  random  process  determine  the  probability  that  we 
will  observe  an  alternating  sequence  of  l’s  and  0’s  for  the  first  100  samples 
with  the  first  sample  being  a  1.  What  is  the  probability  that  we  will  observe 
an  alternating  sequence  of  l’s  and  0’s  for  all  n? 
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16.5  (w)  Classify  the  following  random  processes  as  either  DTDV,  DTCV,  CTDV, 
or  CTCV: 

a.  temperature  in  Rhode  Island 

b.  outcomes  for  continued  spins  of  a  roulette  wheel 

c.  daily  weight  of  person 

d.  number  of  cars  stopped  at  an  intersection 

16.6  (c)  Simulate  a  realization  of  the  random  walk  process  described  in  Example 
16.2  on  a  computer.  What  happens  as  n  becomes  large? 

16.7(0)(C,f)  A  biased  random  walk  process  is  defined  as  X[n]  =  where 

U[i\  is  a  Bernoulli  random  process  with 


What  is  jE7[X[n]]  and  var(X[n])  as  a  function  of  n?  Next,  simulate  on  a 
computer  a  realization  of  this  random  process.  What  happens  as  n  — >  oo  and 
why? 

16.8  (w)  A  random  process  X[n\  is  stationary.  If  it  is  known  that  i£[X[10]]  =  10 
and  var(X[10])  =  1,  then  determine  25[X[100]]  and  var(X[100]). 

16.9  (^)  (f)  The  IID  random  process  X[n\  has  the  marginal  PDF 

Px(x)  =  exp (—x)u(x).  What  is  the  probability  that  X[0],X[1],X[2]  will  all 
be  greater  than  1? 

16.10  (w)  If  an  IID  random  process  X[n\  is  transformed  to  the  random  process 
Y[n]  =  X2[n],  is  the  transformed  random  process  also  IID? 

16.11  (w)  A  Bernoulli  random  process  X[n\  that  takes  on  values  0  or  1,  each  with 
probability  of  p  =  1/2,  is  transformed  using  Y[n]  =  (— l)nX[n].  Is  the  random 
process  Y[n\  IID? 

16.12  (w,f)  A  nonstationary  random  process  is  defined  as  X[n\  =  a^U[n],  where 
0  <  a  <  1  and  U[n]  is  WGN  with  variance  <r^.  Find  the  mean  and  covariance 
sequences  of  X[n\.  Can  you  transform  the  X[n]  random  process  to  make  it 
stationary? 

16.13  (^)  (w)  Consider  the  random  process  X[n\  =  £)”=0l7[i],  which  is  defined 
for  n  >  0.  The  U[n]  random  process  consists  of  independent  Gaussian  ran¬ 
dom  variables  with  marginal  PDF  U[n]  ~  A/*(0,  (l/2)n).  Are  the  increments 
independent?  Are  the  increments  stationary? 
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16.14(c)  Plot  50  realizations  of  a  WGN  random  process  X[n\  with  a1  =  1  for 
n  =  0, 1, . . .  ,  49  using  a  scatter  diagram  (see  Figure  16.15  for  an  example).  Use 
the  MATLAB  commands  plot  (x ,  y ,  ’  •  ’ )  and  hold  on  to  plot  each  realization 
as  dots  and  to  overlay  the  realizations  on  the  same  graph,  respectively.  For  a 
fixed  n  can  you  explain  the  observed  distribution  of  the  dots? 

16.15  (f)  Prove  that 

— -I — —tz - exp  (— ixTC_1 

)JV/2det1/2(C)  V  2 

where  x  =  [x\  . . .  x'/v]T  and  C  =  a2I  for  I  an  N  x  N  identity  matrix,  reduces 

to  (16.4). 

16.16  (0)(f)  A  “white”  uniform  random  process  is  defined  to  be  an  IID  random 
process  with  X[n\  ~  U{—\J 3,  \/3)  for  all  n.  Determine  the  mean  and  covari¬ 
ance  sequences  for  this  random  process  and  compare  them  to  those  of  the 
WGN  random  process.  Explain  your  results. 

16.17  (w)  A  moving  average  random  process  can  be  defined  more  generally  as  one 
for  which  N  samples  of  WGN  are  averaged,  instead  of  only  N  =  2  samples  as 
in  Example  16.7.  It  is  given  by  X[n\  =  (1  /N)  X^o1  U[n  —  i\  for  all  n,  where 
U[n]  is  a  WGN  random  process  with  variance  afj.  Determine  the  correlation 
coefficient  for  X[0]  and  X[l\.  What  happens  as  N  increases? 

16.18  (^)  (f)  For  the  moving  average  random  process  defined  in  Example  16.7 
determine  P[X[n]  >  3]  and  compare  it  to  P[U[n]  >  3].  Explain  the  difference 
in  terms  of  “smoothing”.  Assume  that  afj  =  1. 

16.19  (c)  For  the  randomly  phased  sinusoid  defined  in  Example  16.8  determine  the 
mean  sequence  using  a  computer  simulation. 

16.20  (t)  For  the  randomly  phased  sinusoid  of  Example  16.8  assume  that  the  real¬ 
ization  x[n\  —  cos(27r(0.1)n  +  0)  is  generated.  Prove  that  if  we  observe  only  the 
samples  x[0]  =  1  and  x[l]  =  cos(27r(0.1))  =  0.8090,  then  all  the  future  samples 
can  be  found  by  using  the  recursive  formula  x[n\  =  2cos(27r(0.1))x[n  —  1]  — 
x[n  —  2]  for  n  >  2.  Could  you  also  find  the  past  samples  or  x[n\  for  n  <  —  1? 
See  also  Problem  18.25  for  prediction  of  a  sinusoidal  random  process. 

16.21  (c)  Verify  the  PDF  of  the  randomly  phased  sinusoid  given  in  Figure  16.12 
by  using  a  computer  simulation. 

16.22  (0)  (f,c)  A  continuous-time  random  process  known  as  the  random  am¬ 
plitude  sinusoid  is  defined  as  X(t)  =  Acos(27r£)  for  —  oo  <  t  <  oo  and 
A  rsj  J\[( 0,1).  Find  the  mean  and  covariance  functions.  Then,  plot  some 
realizations  of  X(t)  in  an  overlaid  fashion. 
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16.23  (f)  A  random  process  is  the  sum  of  WGN  and  a  deterministic  sinusoid  and  is 
given  as  X[n]  =  U[n\  -f  sin(27r/on)  for  all  n,  where  U[n\  is  WGN  with  variance 
G^j'  Determine  the  mean  and  covariance  sequences. 

16.24  (o)  (w)  A  random  process  is  IID  with  samples  X[n]  ~  1).  It  is  desired 

to  remove  the  mean  of  the  random  process  by  forming  the  new  random  process 
Y[n\  =  X[n\  —  X[n  —  1].  First  determine  the  mean  sequence  of  Y[n].  Next  find 
cov(y[0],  y[l]).  Is  Y[n\  an  IID  random  process  with  a  zero  mean  sequence? 

16.25  (f)  If  a  random  process  is  defined  as  X[n\  =  h[0]U[n]+h[l]U[n— 1],  where  /i[0] 
and  h[  1]  are  constants  and  U[n\  is  WGN  with  variance  cr^,  find  the  covariance 
for  X[0]  and  X[l].  Repeat  for  X[9]  and  A  [10].  How  do  they  compare? 

16.26  (^)  (f)  If  a  sum  random  process  is  defined  as  X[n]  =  f°r  71  >  0? 

where  £J[?7[i]]  =0  and  var(f7[i])  =  for  i  >  0  and  the  U[i\  are  IID,  find  the 
mean  and  covariance  sequences  of  X[n\. 

16.27  (^)  (c)  For  the  MA  random  process  defined  in  Example  16.7  find  cx[  1, 1], 
ex  [1,2]  and  cx[  1,3]  if  <r^  =  1.  Next  simulate  on  a  computer  M  =  10,000 
realizations  of  the  random  process  X[n\  for  n  =  0, 1, . . . ,  10.  Estimate  the  pre¬ 
vious  covariance  sequence  samples  using  cx[ni,ri2]  =  (1  /M)  Y^Li  xi[ni]xi[n2], 
where  Xi[n\  is  the  ith  realization  of  X[n\.  Note  that  since  X[n\  is  zero  mean, 
cx[nun2]  =  E[X[ni]X[n2]]. 

16.28  (w)  For  the  randomly  phased  sinusoid  described  in  Example  16.11  determine 
the  minimum  mean  square  estimate  of  A  [10]  based  on  observing  #[0].  How 
accurate  do  you  think  this  prediction  will  be? 

16.29(f)  For  a  random  process  X[n]  the  mean  sequence  /ixM  and  covariance 
sequence  cx[ni^n2}  are  known.  It  is  desired  to  predict  k  samples  into  the 
future.  If  x[no ]  is  observed,  find  the  minimum  mean  square  estimate  of  X[no  + 
k\.  Next  assume  that  fj,x[n\  =  cos(27r/on)  and  cx[n\,n2]  =  0.9ln2-nil  and 
evaluate  the  estimate.  Finally,  what  happens  to  your  prediction  as  k  — »  oo 
and  why? 

16.30  (f)  A  random  process  is  defined  as  X[n]  =  As[n]  for  all  n,  where  A  V(  o,  i) 
and  s[n]  is  a  deterministic  signal.  Find  the  mean  and  covariance  sequences. 

16.31  (^)  (f)  A  random  process  is  defined  as  X[n]  =  AU[n]  for  all  n,  where  A  ~ 

A/*(0,  )  and  U[n]  is  WGN  with  variance  af-,  and  A  is  independent  of  U[n]  for 

all  n.  Find  the  mean  and  covariance  sequences.  What  type  of  random  process 
is  X[n]1 

16.32  (f)  Verify  that  by  differentiating  ~  ^)2  with  respect  to  b,  setting 

the  derivative  equal  to  zero,  and  solving  for  b,  we  obtain  the  sample  mean. 
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16.33  (t)  In  this  problem  we  show  how  to  obtain  the  variance  of  a  as  obtained 
by  solving  (16.9).  The  variance  of  a  is  derived  under  the  assumption  that 
X[n\  —  b  +  U[n\,  where  U[n\  is  WGN  with  variance  a2.  This  says  that  we 
assume  the  true  value  of  a  is  zero.  The  steps  are  as  follows: 

a.  Let 


‘  1  0 

r  *[o]  l 

1  1 

X[l] 

H  = 

1  2 

•  • 

X  = 

X[2] 

• 

•  ■ 

•  • 

.  1  N~  1  . 

• 

• 

_  X[N  -  1]  _ 

where  H  is  an  N  x  2  matrix  and  X  is  an  N  x  1  random  vector.  Now 
show  that  that  the  equations  of  (16.9)  can  be  written  as 


HtH 


=  htx. 


b.  The  solution  for  b  and  a  can  now  be  written  symbolically  as 


b 

A 

a 


=  (HtH)_1HtX 

V - v - ' 

G 


Since  X  is  a  Gaussian  random  vector,  show  that  [ba]T  is  also  a  Gaussian 
random  vector  with  mean  [60]T  and  covariance  matrix  cr2(HTH)-1. 

c.  As  a  result  we  can  assert  that  the  marginal  PDF  of  a  is  Gaussian  with 
mean  zero  and  variance  equal  to  the  (2, 2)  element  of  cr2(HTH)-1.  Show 
then  that  a  ~  A7(0,  var(a)),  where 


var(a) 


yr^N—l 

Z-/n=0 


<r 


idS")1 


N 


Next  assume  that  a2  =  10.05,  N  =  108  and  find  the  probability  that  a  > 
0.0173.  Can  we  assert  that  the  estimated  mean  sequence  shown  in  Figure 
16.16  is  not  just  due  to  estimation  error? 

16.34  (^)  (f)  Using  the  results  of  Problem  16.33  determine  the  required  value  of 
N  so  that  the  probability  that  a  >  0.0173  is  less  than  10“6. 


Chapter  17 


Wide  Sense  Stationary  Random 
Processes 

17.1  Introduction 

Having  introduced  the  concept  of  a  random  process  in  the  previous  chapter,  we 
now  wish  to  explore  an  important  subclass  of  stationary  random  processes.  This  is 
motivated  by  the  very  restrictive  nature  of  the  stationarity  condition,  which  although 
mathematically  expedient,  is  almost  never  satisfied  in  practice.  A  somewhat  weaker 
type  of  stationarity  is  based  on  requiring  the  mean  to  be  a  constant  in  time  and 
the  covariance  sequence  to  depend  only  on  the  separation  in  time  between  the  two 
samples.  We  have  already  encountered  these  types  of  random  processes  in  Examples 
16.9-16.11.  Such  a  random  process  is  said  to  be  stationary  in  the  wide  sense  or  wide 
sense  stationary  (WSS).  It  is  also  termed  a  weakly  stationary  random  process  to 
distinguish  it  from  a  stationary  process,  which  is  said  to  be  strictly  stationary.  We 
will  use  the  former  terminology  to  refer  to  such  a  process  as  a  WSS  random  process. 
In  addition,  as  we  will  see  in  Chapter  19,  if  the  random  process  is  Gaussian,  then 
wide  sense  stationarity  implies  stationarity.  For  this  reason  alone,  it  makes  sense 
to  explore  WSS  random  processes  since  the  use  of  Gaussian  random  processes  for 
modeling  is  ubiquitous. 

Once  we  have  discussed  the  concept  of  a  WSS  random  process,  we  will  be  able 
to  define  an  extremely  important  measure  of  the  WSS  random  process — the  power 
spectral  density  (PSD).  This  function  extends  the  idea  of  analyzing  the  behavior  of  a 
deterministic  signal  by  decomposing  it  into  a  sum  of  sinusoids  of  different  frequencies 
to  that  of  a  random  process.  The  difference  now  is  that  the  amplitudes  and  phases 
of  the  sinusoids  will  be  random  variables  and  so  it  will  be  convenient  to  quantify  the 
average  power  of  the  various  sinusoids.  This  description  of  a  random  phenomenon 
is  important  in  nearly  every  scientific  field  that  is  concerned  with  the  analysis  of 
time  series  data  such  as  systems  control  [Box  and  Jenkins  1970],  signal  processing 
[Schwartz  and  Shaw  1975],  economics  [Harvey  1989],  geophysics  [Robinson  1967], 
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vibration  testing  [McConnell  1995],  financial  analysis  [Taylor  1986],  and  others.  As 
an  example,  in  Figure  17.1  the  Wolfer  sunspot  data  [Tong  1990]  is  shown,  with  the 
data  points  connected  by  straight  lines  for  easier  viewing.  It  measures  the  average 
number  of  sunspots  visually  observed  through  a  telescope  each  year.  The  importance 
of  the  sunspot  number  is  that  as  it  increases,  an  increase  in  solar  flares  occurs.  This 
has  the  effect  of  disrupting  all  radio  communications  as  the  solar  flare  particles  reach 
the  earth.  Clearly  from  the  data  we  see  a  periodic  type  property.  The  estimated 
PSD  of  this  data  set  is  shown  in  Figure  17.2.  We  see  that  the  distribution  of  power 
versus  frequency  is  highest  at  a  frequency  of  about  0.09  cycles  per  year.  This  means 
that  the  random  process  exhibits  a  large  periodic  component  with  a  period  of  about 
1/0.09  «  11  years  per  cycle,  as  is  also  evident  from  Figure  17.1.  This  is  a  powerful 
prediction  tool  and  therefore  is  of  great  interest.  How  the  PSD  is  actually  estimated 
will  be  discussed  in  this  chapter,  but  before  doing  so,  we  will  need  to  lay  some 
groundwork. 


Figure  17.1:  Annual  number  of  sunspots  -  Wolfer  sunspot  data. 


17.2  Summary 

A  less  restrictive  form  of  stationarity,  termed  wide  sense  stationarity,  is  defined  by 
(17.4)  and  (17.5).  The  conditions  require  the  mean  to  be  the  same  for  all  n  and  the 
covariance  sequence  to  depend  only  on  the  time  difference  between  the  samples.  A 
random  process  that  is  stationary  is  also  wide  sense  stationary  as  shown  in  Section 
17.3.  The  autocorrelation  sequence  is  defined  by  (17.9)  with  n  being  arbitrary.  It 
is  the  covariance  between  two  samples  separated  by  k  units  for  a  zero  mean  WSS 
random  process.  Some  of  its  properties  are  summarized  by  Properties  17.1-17.4. 
Under  certain  conditions  the  mean  of  a  WSS  random  process  can  be  found  by  using 


1 7.3.  DEFINITION  OF  WSS  RANDOM  PROCESS 


549 


xIO4 


Figure  17.2:  Estimated  power  spectral  density  for  Wolfer  sunspot  data  of  Figure 
17.1.  The  sample  mean  has  been  computed  and  removed  from  the  data  prior  to 
estimation  of  the  PSD. 


the  temporal  average  of  (17.25).  Such  a  process  is  said  to  be  ergodic  in  the  mean.  For 
this  to  be  true  the  variance  of  the  temporal  average  given  by  (17.28)  must  converge 
to  zero  as  the  number  of  samples  averaged  becomes  large.  The  power  spectral 
density  (PSD)  of  a  WSS  random  process  is  defined  by  (17.30)  and  can  be  evaluated 
more  simply  using  (17.34).  The  latter  relationship  says  that  the  PSD  is  the  Fourier 
transform  of  the  autocorrelation  sequence.  It  measures  the  amount  of  average  power 
per  unit  frequency  or  the  distribution  of  average  power  with  frequency.  Some  of  its 
properties  are  summarized  in  Properties  17.7-17.12.  From  a  finite  segment  of  a 
realization  of  the  random  process  the  autocorrelation  sequence  can  be  estimated 
using  (17.43)  and  the  PSD  can  be  estimated  by  using  the  averaged  periodogram 
estimate  of  (17.44)  and  (17.45).  The  analogous  definitions  for  a  continuous-time 
WSS  random  process  are  given  in  Section  17.8.  Also,  an  important  example  is 
described  that  relates  sampled  continuous-time  white  Gaussian  noise  to  discrete¬ 
time  white  Gaussian  noise.  Finally,  an  application  of  the  use  of  PSDs  to  random 
vibration  testing  is  given  in  Section  17.9. 


17.3  Definition  of  WSS  Random  Process 

Consider  a  discrete- time  random  process  X[n],  which  is  defined  for  —  oo  <  n  <  oo 
with  n  an  integer.  Previously,  we  defined  the  mean  and  covariance  sequences  of 
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X[n ]  to  be 

Hx[n\  =  -©[-X’fn]]  —  oo  <  n  <  oo  (17.1) 

cx  [m,»2]  =  B[(Jf[n1]-M[n1])(JfH-/ixH)] 

where  ni,n2  are  integers.  Having  knowledge  of  these  sequences  allows  us  to  assess 
important  characteristics  of  the  random  process  such  as  the  mean  level  and  the 
correlation  between  samples.  In  fact,  based  on  only  this  information  we  are  able  to 
predict  X[n2]  based  on  observing  X[ni]  =  x[n\ ]  as 

X[n2]  =  fxx [n2]  +  — [^1  (x[ny]  -  M*M)  (17.3) 

which  is  just  the  usual  linear  prediction  formula  of  (7.41)  with  x  replaced  by  x[n{[ 
and  Y  replaced  by  X[n 2],  and  which  makes  use  of  the  mean  and  covariance  sequences 
defined  in  (17.1)  and  (17.2),  respectively.  However,  since  in  general  the  mean  and 
covariance  change  with  time,  i.e.,  they  are  nonstationary,  it  would  be  exceedingly 
difficult  to  estimate  them  in  practice.  To  extend  the  practical  utility  we  would  like 
the  mean  not  to  depend  on  time  and  the  covariance  only  to  depend  on  the  separation 
between  samples  or  on  \ri2  —  n\\.  This  will  allow  us  to  estimate  these  quantities  as 
described  later.  Thus,  we  are  led  to  a  weaker  form  of  stationarity  known  as  wide 
sense  stationarity.  A  random  process  is  defined  to  be  WSS  if 

px[n]  =  fi  (a  constant)  —  00  <  n  <  00  (17.4) 

cx[ni,n2\  =  g(\n2~ni\)  —  00  <  n\  <  00,  —  00  <  n2  <  00  (17.5) 

for  some  function  g.  Note  that  since 

cx[ni,n2]  -  E[X[m]X[n2]]  -  E[X[m]\E[X[n 2]] 

these  conditions  are  equivalent  to  requiring  that  X[n ]  satisfy 

JS[X[n]]  =  —  00  <  n  <  00 

l£[X[ni]X[n2]]  =  h(\n2  —  ni|)  —  00  <  n\  <  00,  —00  <  U2  <  00 

for  some  function  h.  The  mean  should  not  depend  on  time  and  the  average  value 

of  the  product  of  two  samples  should  depend  only  upon  the  time  interval  between 
the  samples.  Some  examples  of  WSS  random  processes  have  already  been  given  in 
Examples  16.9-16.11.  For  the  MA  process  of  Example  16.10  we  showed  that 

fix  M  =  0  —  00  <  n  <  00 

{\ofj  |n2  —  «i|  =  0 
\o\j  |n2  —  nx|  =  1 
0  \ri2  —  ni|  >  1. 
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It  is  seen  that  every  random  variable  X[n]  for  —  oo  <  n  <  oo  has  a  mean  of  zero 
and  the  covariance  for  two  samples  depends  only  on  the  time  interval  between  the 
samples,  which  is  |n2  —  ni|.  Also,  this  implies  that  the  variance  does  not  depend 
on  time  since  var(X[n])  =  cx[n,n]  =  a^/2  for  — oo  <  n  <  oo.  In  contrast  to  this 
behavior  consider  the  random  processes  for  which  typical  realizations  are  shown  in 
Figure  16.7.  In  Figure  16.7a  the  mean  changes  with  time  (with  the  variance  being 
constant)  and  in  Figure  16.7b  the  variance  changes  with  time  (with  the  mean  being 
constant).  Clearly,  these  random  processes  are  not  WSS. 

A  WSS  random  process  is  a  special  case  of  a  stationary  random  process.  To  see 
this  recall  that  if  X[n\  is  stationary,  then  from  (16.3)  with  N  =  1  and  n\  =  n,  we 
have 

PX[n+n0]  =  PX[n]  for  a11  n  and  for  a11  n0- 
As  a  consequence,  if  we  let  n  =  0,  then 

Px[no }  =  Px[ o]  for  all  no 

and  since  the  PDF  does  not  depend  on  the  particular  time  no,  the  mean  must  not 
depend  on  time.  Thus, 


H j[n]  =  /jL  —  oo  <  n  <  oo.  (17.6) 

Next,  using  (16.3)  with  N  =  2,  we  have 

PX[ni-\-no],X[ri2-\-no]  PX[ni\,X[ri2]  ad  ^1^2  &nd  ^-0*  (17*7) 

Now  if  no  =  —ni  we  have  from  (17.7) 


PX[Q\,X[ri2—ni]  -  PX[n i\,X[n2] 


and  if  no  =  —  n2,  we  have 


PX[n\—  712], X[0]  ~  P X[ni\,X[ri2]‘ 

This  results  in 

Px[m\,x[n2\  =  Px[0],x[n2-ni] 

PX[m],X[n2]  =  PX[ni-n2\,X[0] 

which  leads  to 

E[X[m]X[n2]]  =  E[X[0]X[n2  -  m]\ 

E[X[n i]X[n2]]  =  E[X[m  -  n2]X[ 0]]  =  E[X[ Q\X[m  -  n2]]. 

Finally,  these  two  conditions  combine  to  give 


E[X[m]X[n2]\  =  E[X[0]X[\n2  -  ni\l 


(17.8) 
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which  along  with  the  mean  being  constant  with  time  yields  the  second  condition  for 
wide  sense  stationarity  of  (17.5)  that 

cx[niM  =  E[X[ni]X[n2]]  -  £[X[m]]£[X[n2]]  =  E[X[0]X[\n2  -  m|]  -  /i2. 

This  proves  the  assertion  that  a  stationary  random  process  is  WSS  but  the  converse 
is  not  generally  true  (see  Problem  17.5). 

17.4  Autocorrelation  Sequence 

If  X[n]  is  WSS,  then  as  we  have  seen  £'[X[ni]X[n2]]  depends  only  on  the  separation 
in  time  between  the  samples.  We  can  therefore  define  a  new  joint  moment  by  letting 
ni  =  n  and  —  n  +  k  to  yield 

rx[k]  =  E[X[n]X[n  +  k}\  (17.9) 

which  is  called  the  autocorrelation  sequence  (ACS).  It  depends  only  on  the  time 
difference  between  samples  which  is  \n2  —  n\\  =  \{n  +  k)  —  n\  =  |fc|  so  that  the  value 
of  n  used  in  the  definition  is  arbitrary.  It  is  termed  the  autocorrelation  sequence 
(ACS)  since  it  measures  the  correlation  between  two  samples  of  the  same  random 
process.  Later  we  will  have  occasion  to  define  correlation  between  two  different 
random  processes  (see  Section  19.3).  Note  that  the  time  interval  between  samples 
is  also  called  the  lag.  An  example  of  the  computation  of  the  ACS  is  given  next. 

Example  17.1  —  A  Differencer 

Define  a  random  process  as  X[n\  =  U[n\  —  U[n  —  1],  where  U[n\  is  an  IID  random 
process  with  mean  ji  and  variance  afj .  A  realization  of  this  random  process  for  which 
U[n]  is  a  Gaussian  random  variable  for  all  n  is  shown  in  Figure  17.3.  Although 
U[n\  was  chosen  here  to  be  a  sequence  of  Gaussian  random  variables  for  the  sake 
of  displaying  the  realization  in  Figure  17.3,  the  ACS  to  be  found  will  be  the  same 
regardless  of  the  PDF  of  U[n\.  This  is  because  it  relies  on  only  the  first  two  moments 
of  U[n ]  and  not  its  PDF.  The  ACS  is  found  as 

rx[k\  =  E[X[n\X[n  +  k]\ 

=  E[(U[n \  -  U[n  -  1  })(U[n  +  k\-U[n  +  k-  1])] 

-  E[U[n]U[n  +  k}\- E[U[n]U[n  +  k-  1]] 

-  E[U[n  -  1  )U[n  +  k]\  +  E[U[n  -  1  }U[n  +  k-  1]]. 

But  for  n\  ^  n2 

E[U[ni]U[n2]}  —  E[U[ni]]E[U[n2]\  (independence) 

=  M2 

and  for  n\  =  U2  =  n 

E[U[ni]U[n2]\  =  E[U2[n]]  =  E[[/2[0]]  =  +  fi2  (identically  distributed). 
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Figure  17.3:  Typical  realization  of  a  differenced  IID  Gaussian  random  process  with 
U[n\  ~  A7(l,  1). 

Combining  these  results  we  have  that 

E[U[ni]U[ri2]]  =  /i2  +  a u8[n2  -  n\] 

and  therefore  the  ACS  becomes 

rxN  =  2<r  ij8[k\  —  (Jij8[k  —  1]  —  a  +  1].  (17.10) 

This  is  shown  in  Figure  17.4.  Several  observations  can  be  made.  The  only  nonzero 
correlation  is  between  adjacent  samples  and  this  correlation  is  negative.  This  ac¬ 
counts  for  the  observation  that  the  realization  shown  in  Figure  17.3  exhibits  many 
adjacent  samples  that  are  opposite  in  sign.  Some  other  observations  are  that 
rx[0]  >  0,  |rx[fc]|  <  rx[0]  for  all  fc,  and  finally  rx[— k]  =  rx[k].  In  words,  the 
ACS  has  a  maximum  at  k  =  0,  which  is  positive,  and  is  a  symmetric  sequence  about 
k  —  0  (also  called  an  even  sequence).  These  properties  hold  in  general  as  we  now 
prove. 

0 


Property  17.1  — 

Proof: 


ACS  is  positive  for  the  zero  lag  or  rx[ 0]  >  0. 

rx[k]  =  E[X[n]X[n  +  k]\  (definition) 


so  that  with  k  =  0  we  have  rx[ 0]  =  2£[X2[n]]  >  0. 


□ 


Note  that  rx[ 0]  is  the  average  power  of  the  random  process  at  all  sample  times 
n.  One  can  view  X[n\  as  the  voltage  across  a  1  ohm  resistor  and  hence  #2[n]/l 
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Figure  17.4:  Autocorrelation  sequence  for  differenced  random  process. 

is  the  power  for  any  particular  realization  of  X[n\  at  time  n.  The  average  power 
£?[A2[n]]  =  rx[0]  does  not  change  with  time. 

Property  17.2  —  ACS  is  an  even  sequence  or  rx[— k]  =  rx[k]- 
Proof: 

rx[k]  =  E[X[n)X[n  +  k]\  (definition) 

rx[-k]  =  E[X[n]X[n  -  k]\ 

and  letting  n  =  m  +  k  since  the  choice  of  n  in  the  definition  of  the  ACS  is  arbitrary, 
we  have 

rx[— k]  =  E[X[m  +  k]X[rri\] 

=  E[X[m\X[m  +  k}\ 

=  E[X[n\X[n  +  k]\  (ACS  not  dependent  on  n) 

=  rx  [fc]  • 

□ 

Property  17.3  —  Maximum  absolute  value  of  ACS  is  at  k  =  0  or  |rx[fc]|  < 
rx  [0] . 

Note  that  it  is  possible  for  some  values  of  rx  [&]  for  k  ^  0  to  also  equal  rx  [0] .  As  an 
example,  for  the  randomly  phased  sinusoid  of  Example  16.11  we  had  cx[ni,n2]  = 
5  cos[27r(0.1)(n2  —  ni)]  with  a  mean  of  zero.  Thus,  rx[k]  =  \  cos[27r(0.1)/c]  and 
therefore  rx[10]  =  rx  [0] .  Hence,  the  property  says  that  no  value  of  the  ACS  can 
exceed  rx  [0] ,  although  there  may  be  multiple  values  of  the  ACS  that  are  equal  to 
rx  [0] . 
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Proof:  The  proof  is  based  on  the  Cauchy-Schwarz  inequality,  which  from  Appendix 
7  A  is 

\EVjW[VW]\  <  y/Ev[V2WEw[W2} 

with  equality  holding  if  and  only  if  W  =  cV  for  c  a  constant.  Letting  V  =  X[n]  and 
W  =  X[n  +  k],  we  have 

\E[X[n]X[n  +  k}}\  <  \J E[X2 [n]] y/ E[X2 [n  +  k}} 


from  which  it  follows  that 


rx[k\ |  <  VrxMv^M  —  |rx[0]|  =  rx[0]  (since  rx[ 0]  >  0). 

Note  that  equality  holds  if  and  only  if  X[n  +  k]  =  cX[n ]  for  all  n.  This  implies 
perfect  predictability  of  a  sample  based  on  the  realization  of  another  sample  spaced 
k  units  ahead  or  behind  in  time  (see  Problem  17.10  for  an  example  involving  periodic 
random  processes). 

□ 


Property  17.4  —  ACS  measures  the  predictability  of  a  random  process. 

The  correlation  coefficient  for  two  samples  of  a  zero  mean  WSS  random  process  is 


PX[n],X[n+k] 


rx[k] 

r^[0] 


(17.11) 


For  a  nonzero  mean  the  expression  is  easily  modified  (see  Problem  17.11). 
Proof:  Recall  that  the  correlation  coefficient  for  two  random  variables  V  and  W 
is  defined  as 

cov(F,  W) 

QVW  —  — /  ■■■■ 

i/var(F)var  (W) 

Assuming  that  V  and  W  are  zero  mean,  this  becomes 

^  _  Ey,w[VW] 

PV,W  ~  y/ Ev[V2]Ew[W2} 


and  letting  V  =  X[n]  and  W  =  X[n  +  k],  we  have 


PX[n],X[n+k] 


E[X\n\X\n  +  jfe]] 

^  E[X2[n]]E[X2[n  +  jfe]] 
rx\M 

sjrx  [0]rx[0] 
rx[k] 

kx[oj| 

rx[k] 


rx[0] 


(from  Property  17.1) 
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As  an  example,  for  the  differencer  of  Example  17.1  we  have  from  Figure  17.4 


PX[n],X[n+k]  = 


1  k  =  0 

k  =  ±1 

0  otherwise. 


As  mentioned  previously,  the  adjacent  samples  are  negatively  correlated  and  the 
magnitude  of  the  correlation  coefficient  is  now  seen  to  be  1/2. 

We  next  give  some  more  examples  of  the  computation  of  the  ACS. 

Example  17.2  -  White  noise 

White  noise  is  defined  as  a  WSS  random  process  with  zero  mean ,  identical  variance 
a2 ,  and  uncorrelated  samples.  It  is  a  more  general  case  of  the  white  noise  random 
process  first  described  in  Example  16.9.  There  we  assumed  the  stronger  condition 
of  zero  mean  IID  samples  (hence  they  must  have  the  same  variance  due  to  the 
identically  distributed  assumption  and  also  be  uncorrelated  due  to  the  independence 
assumption).  In  addition,  it  was  assumed  there  that  each  sample  had  a  Gaussian 
PDF.  Note,  however,  that  the  definition  given  above  for  white  noise  does  not  specify 
a  particular  PDF.  To  find  the  ACS  we  note  that  from  the  definition  of  the  white 
noise  random  process 


rx[k]  =  E[X[n]X[n  +  k]] 

=  E[X[n]\E[X[n  +  k]\  =  0  k  +  0 

=  E[X2[n]]  =  a2  k  =  0 


(uncorrelated  and 
zero  mean  samples) 
(equal  variance  samples). 


Therefore,  we  have  that 

rx[k]  =  cr25[k\. 


(17.12) 


Could  you  predict  X[l]  from  a  realization  of  Jf[0]? 


❖ 

As  an  aside,  for  WSS  random  processes,  we  can  find  the  covariance  sequence  from 
the  ACS  and  the  mean  since 


cx[ni,n2]  =  E[X[ni]X[n2]\ 

=  rx[n2  -  ni]  -  ii2.  (17.13) 

Another  property  of  the  ACS  that  is  evident  from  (17.13)  concerns  the  behavior  of 
the  ACS  as  k  — >  oo.  Letting  n\—n  and  n<i  =  n  +  A;,  we  have  that 

rx[k\  =  cx[n,n  +  k\  +  /x2.  (17.14) 

If  two  samples  becomes  uncorrelated  or  cx[n,n  +  k]  -»  0  as  k  ->  oo,  then  we  see 
that  rx[k\  — >  fi2  as  k  — oo.  Thus,  as  another  property  of  the  ACS  we  have  the 
following. 
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Property  17.5  —  ACS  approaches  /i2  as  k  — >  oo 

This  assumes  that  the  samples  become  uncorrelated  for  large  lags,  which  is  usually 
the  case. 

□ 

If  the  mean  is  zero,  then  from  (17.14) 

rx[k]  =  cx[n,n  +  k]  (17.15) 

and  the  ACS  approaches  zero  as  the  lag  increases.  We  continue  with  some  more 
examples. 

0 


Example  17.3  -  MA  random  process 

This  random  process  was  shown  in  Example  16.10  to  have  a  zero  mean  and  a 
covariance  sequence 

n\  =  n2 

|n2  —  Tii  |  —  1  (17.16) 

otherwise. 

Since  the  covariance  sequence  depends  only  on  |n2  —  ni|,  X[n\  is  WSS  from  (17.15). 
Specifically,  the  ACS  follows  from  (17.15)  and  (17.16)  with  k  =  n2  —  n\  as 

rx[k]  =  < 


See  Figure  16.13  for  a  plot  of  the  ACS 
from  a  realization  of  X[0]7 

0 


f  *  =  o 
4  k  =  ±i 

0  otherwise. 

(replace  An  with  k.)  Could  you  predict  X[l\ 


'  (T2 

7jj_ 


cx[nun2\  =  { 


0 


Example  17.4  -  Randomly  phased  sinusoid 

This  random  process  was  shown  in  Example  16.11  to  have  a  zero  mean  and  a  covari¬ 
ance  sequence  cx[ni,n2]  =  ^  cos[27r(0.1)(n2  —  ni)].  Since  the  covariance  sequence 
depends  only  on  \n2  -  ni|,  X[n\  is  WSS.  Hence,  from  (17.15)  we  have  that 

rx[k]  =  -  cos[27r(0.1)A;]. 

Z 

See  Figure  16.14  for  a  plot  of  the  ACS  (replace  An  with  k.)  Could  you  predict  X[l] 
from  a  realization  of  X[0]? 


0 
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In  determining  predictability  of  a  WSS  random  process,  it  is  convenient  to  consider 
the  linear  predictor,  which  depends  only  on  the  first  two  moments.  Then,  the  MMSE 
linear  prediction  of  X[no  +  k\  given  a; [no]  is  from  (17.3)  and  (17.13)  with  n\  =  no 
and  ri2  —  no  +  k 

X[tiq  +  k\  =  n  +  TX |^| — ^ ix[no\  —  A)  for  all  k  and  no- 

For  a  zero  mean  random  process  this  becomes 

X[n0  +  k]  =  ^|^ja;[no] 

rx[0J 

~  Px[no],x[no+k]x\'n'o\  for  all  ^  and  no* 

One  last  example  is  the  autoregressive  random  process  which  we  will  use  to  illustrate 
several  new  concepts  for  WSS  random  processes. 

Example  17.5  -  Autoregressive  random  process 

An  autoregressive  (AR)  random  process  X[n]  is  defined  to  be  a  WSS  random  process 
with  a  zero  mean  that  evolves  according  to  the  recursive  difference  equation 

X[n]  =  aX[n  —  1]  +  U[n]  —  oo  <  n  <  oo  (17.17) 

where  \a\  <  1  and  U[n]  is  WGN.  The  WGN  random  process  U[n\  (see  Example 
16.6),  has  a  zero  mean  and  variance  a ^  for  all  n  and  its  samples  are  all  independent 
with  a  Gaussian  PDF.  The  name  autoregressive  is  due  to  the  regression  of  X[n]  upon 
X[n  —  1],  which  is  another  sample  of  the  same  random  process,  hence,  the  prefix 
auto.  The  evolution  of  X[n]  proceeds,  for  example,  as 


X[0]  =  aX[-l]  +  U[0\ 
X[l\  =  aX[0]  +  U[l] 
X[2]  =  aX[l\  +  U[2] 


Note  that  X[n]  depends  only  upon  the  present  and  past  values  of  U[n)  since  for 
example 


X[2]  =  aX[l]  +  U[2]  =  a(aX[0]  +  U[l])  +  U[2]=a2X[0]+aU[l]  +  U[2] 

=  o2(aI[- 1]  +  C/[0])  +  aU[  1]  +  U[  2]  =  a3X[-l]  +  a2U[  0]  +  aU[  1]  +  U[2] 


OO 

=  ^  akU[2  —  k] 


(17.18) 
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where  the  term  involving  akU[ 2  —  k\  decays  to  zero  as  k  oo  since  \a\  <  1.  We 
see  that  X[2]  depends  only  on  {[/[2],  U[  1], . . .}  and  it  is  therefore  uncorrelated  with 
{?7[3],  £/[4], . . .}.  More  generally,  it  can  be  shown  that  (see  also  Problem  19.6) 

E[X[n]U[n  +  k}\  =  0  k>  1.  (17.19) 

It  is  seen  from  (17.18)  that  in  order  for  the  recursion  to  be  stable  and  hence  X[n ]  to 
be  WSS  it  is  required  that  \a\  <  1.  The  AR  random  process  can  be  used  to  model 
a  wide  variety  of  physical  random  processes  with  various  ACSs,  depending  upon 
the  choice  of  the  parameters  a  and  <j^.  Some  typical  realizations  of  the  AR  random 
process  for  different  values  of  a  are  shown  in  Figure  17.5.  The  WGN  random  process 
U[n]  has  been  chosen  to  have  a  variance  =  1  —  a2.  We  will  soon  see  that  this 
choice  of  variance  results  in  rx[0]  =  1  for  both  AR  processes  shown  in  Figure  17.5. 

The  MATLAB  code  used  to  generate  the  realizations  shown  is  given  below. 


(a)  a  =  0.25,  afj  =  1  —  a2  (b)  a  =  0.98,  afj  =  1  —  a2 


Figure  17.5:  Typical  realizations  of  autoregressive  random  process  with  different 
parameters. 

clear  all 

randn (’ state 3 ,0) 

al=0.25;a2=0.98; 

varul=l-al~2;varu2=l-a2~2; 

varxl=varul/(l-al~2) ; varx2=varu2/(l-a2~2) ;  7,  this  is  r_X[0] 
xl(l,l)=sqrt(varxl)*randn(l,l)  ;  7#  set  initial  condition  X[-l] 

7.  see  Problems  17.17,  17.18 

x2(l , l)=sqrt (varx2)*randn(l , 1) ; 
for  n=2:31 

xl (n, l)=al*xl (n-l)+sqrt (varul) *randn(l , 1) ; 
x2 (n , 1 ) =a2*x2 (n- 1 ) +sqrt ( varu2 ) *randn (1,1); 

end 
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We  next  derive  the  ACS.  In  Chapter  18  we  will  see  how  to  alternatively  obtain 
the  ACS  using  results  from  linear  systems  theory.  Using  (17.17)  we  have  for  k  >  1 


rx[k]  =  E[X[n]X[n  +  k]] 

=  E[X[n](aX[n  +  k  -  1]  +  U[n  +  k])\ 

=  aE[X[n]X[n  +  k  -  1]]  (using  (17.19)) 

=  arx[k  —  1].  (17.20) 


The  solution  of  this  recursive  linear  difference  equation  is  readily  seen  to  be  rx  [&]  = 
cak,  for  c  any  constant  and  for  k  >  1.  For  k  =  1  we  have  that  rx[l]  =  ca  and  so 
from  (17.20)  rx[l]  =  of'x [0] ,  which  implies  c  =  rx [0] .  In  Problem  17.15  it  is  shown 
that 


rx  [0] 


<7 


U 


1  -a' 


so  that  for  all  k  >  0,  rx[k]  =  rx[0]ak  becomes 

rx[k]  =  7~~2ak- 
1  —  az 

Finally,  noting  that  rx[—k ]  =  rx[k]  from  Property  17.2,  we  obtain  the  ACS  as 

.2 


rx[k] 


— — a)k\  —  oo  <  k  <  oo.  (17.21) 


1  —  a2 


(See  also  Problem  17.16  for  an  alternative  derivation  of  the  ACS.)  The  ACS  is 
plotted  in  Figure  17.6  for  a  =  0.25  and  a  =  0.98  and  afj  =  1  —  a2.  For  both  values  of 
a  the  value  of  a ^  has  been  chosen  to  ensure  that  rx[ 0]  =  1.  Note  that  for  a  =  0.25 
the  ACS  dies  off  very  rapidly  which  means  that  the  random  process  samples  quickly 
become  uncorrelated  as  the  separation  between  them  increases.  This  is  consistent 
with  the  typical  realization  shown  in  Figure  17.5a.  For  a  =  0.98  the  ACS  decays 
very  slowly,  indicating  a  strong  positive  correlation  between  samples,  and  again 
being  consistent  with  the  typical  realization  shown  in  Figure  17.5b.  In  either  case 
the  samples  become  uncorrelated  as  k  -*  oo  since  \a\  <  1  and  therefore,  rx[k]  0 
as  k  -»  oo  in  accordance  with  Property  17.5.  However,  the  random  process  with  the 
slower  decaying  ACS  is  more  predictable. 

0 

One  last  property  that  is  necessary  for  a  sequence  to  be  a  valid  ACS  is  the  property 
of  positive  definiteness.  As  its  name  implies,  it  is  related  to  the  positive  definite 
property  of  the  covariance  matrix.  As  an  example,  consider  the  random  vector 
X  =  [X[0]  X[1]]T.  Then  we  know  from  the  proof  of  Property  9.2  (covariance  matrix 
is  positive  semidefinite)  that  if  Y  =  aoX[0]  +  a\X[l]  cannot  be  made  equal  to  a 
constant  by  any  choice  of  ao  and  a\ ,  then 


var(T)  =  [  a0  Oi  ] 


-V- 

o  T 


cov(X[0],A[0])  cov(X[0],X[l]) 
cov(X[l],A[0])  cov(X[l],X[l]) 

v” . . . 

Cx 
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k 


(a)  a  =  0.25,  afj  =  1  —  a2  (b)  a  =  0.98,  =  1  —  a2 

Figure  17.6:  The  autocorrelation  sequence  for  autoregressive  random  processes  with 
different  parameters. 

Since  this  holds  for  all  0,  the  covariance  matrix  Cx  is  by  definition  positive 
definite  (see  Appendix  C).  (If  it  were  possible  to  choose  ao  and  a\  so  that  Y  =  c, 
for  c  a  constant,  then  X[l]  would  be  perfectly  predictable  from  X[0\  as  X[l]  = 
—  (ao/ai)X[0]  +  (c/ai).  Therefore,  we  could  have  var(Y)  =  aTCja  =  0,  and  Cx 
would  only  be  positive  semidefinite.)  Now  if  X[n\  is  a  zero  mean  WSS  random 
process 

cov(X[ni],X[n2])  =  E(X[ni]X[n2])  =  rx[ri2  -  n{\ 
and  the  covariance  matrix  becomes 

rx[ 0]  rx[  1]  ’ 
rx[  1]  rx[  0] 

- v - ' 

Rx 

Therefore,  the  covariance  matrix,  which  we  now  denote  by  and  which  is  called 
the  autocorrelation  matrix ,  must  be  positive  definite.  This  implies  that  all  the 
principal  minors  (see  Appendix  C)  are  positive.  For  the  2x2  case  this  means  that 

rx  [0]  >  0 

r2X[0]  ~  r2x[l]  >  0  (17.22) 

with  the  first  condition  being  consistent  with  Property  17.1  and  the  second  condition 
producing  rx[0]  >  |rx[l]|.  The  latter  condition  is  nearly  consistent  with  Property 
17.3  with  the  slight  difference,  that  |rx[l]|  may  equal  rx[ 0]  being  excluded.  This  is 
because  we  assumed  that  X[l\  was  not  perfectly  predictable  from  knowledge  of  X[0], 
If  we  allow  perfect  predictability,  then  the  autocorrelation  matrix  is  only  positive 


rx[  0]  rx[  1] 
rx[- 1]  rx[0] 
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semidefinite  and  the  >  sign  in  the  second  equation  of  (17.22)  would  be  replaced 
with  >.  In  general  the  N  x  N  autocorrelation  matrix  Rj  is  given  as  the  covariance 
matrix  of  the  zero  mean  random  vector  X  =  [X[0\  X[l] . . .  X[N  —  1]]T  as 


rx  [0]  rx  [1]  rx  [2] 

rx  [1]  rx  [0]  rx  [1] 


rx[N  -  1] 
rx[N  -  2] 


rx[N  —  1]  rx[N  —  2]  rx[N  —  3]  ...  rx[0] 


(17.23) 


For  a  sequence  to  be  a  valid  ACS  the  N  xN  autocorrelation  matrix  must  be  positive 
semidefinite  for  all  N  =  1,2,...  and  positive  definite  if  we  exclude  the  possibility  of 
perfect  predictability  [Brockwell  and  Davis  1987].  This  imposes  a  large  number  of 
constraints  on  rx[k]  and  hence  not  all  sequences  satisfying  Properties  17.1-17.3  are 
valid  ACSs  (see  also  Problem  17.19).  In  summary,  for  our  last  property  of  the  ACS 
we  have  the  following. 

Property  17.6  -  ACS  is  a  positive  semidefinite  sequence. 

Mathematically,  this  means  that  rx[k\  must  satisfy 

>  0 


for  all  a  =  [no  &i  •  •  •  aN-i]T  and  where  Rx  is  the  N  x  N  autocorrelation  matrix 
given  by  (17.23).  This  must  hold  for  all  N  >  1. 

□ 


17.5  Ergodicity  and  Temporal  Averages 

When  a  random  process  is  WSS,  its  mean  does  not  depend  on  time.  Hence,  the 
random  variables  . . . ,  X[—  1],  X[0],  X[l], . . .  all  have  the  same  mean.  Then,  at  least 
as  far  as  the  mean  is  concerned,  when  we  observe  a  realization  of  a  random  process, 
it  is  as  if  we  are  observing  multiple  realizations  of  the  same  random  variable.  This 
suggests  that  we  may  be  able  to  determine  the  value  of  the  mean  from  a  single 
infinite  length  realization.  To  pursue  this  idea  further  we  plot  three  realizations  of 
an  IID  random  process  whose  marginal  PDF  is  Gaussian  with  mean  fj,x[n\  =  /j,  =  1 
and  a  variance  (J2x[n ]  =  a2  =  1  in  Figure  17.7.  If  we  let  a;* [18]  denote  the  ith 
realization  at  time  n  =  18,  then  by  definition  of  E[X[1S]] 

1  M 

Jim  TF  Y  x™i18}  =  E[X[1S]]  =  m*[18]  =  /i  =  1.  (17.24) 

M—>oo  1V1  * — ' 
m=  1 

This  is  because  as  we  observe  all  realizations  of  the  random  variable  X[18]  they  will 
conform  to  the  Gaussian  PDF  (recall  that  X[n ]  ~  V(l,  1)).  In  fact,  the  original 
definition  of  expected  value  was  based  on  the  relationship  given  in  (17.24).  This 
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n 


ensemble  averaging 

Figure  17.7:  Several  realizations  of  WSS  random  process  with  fix\p\  —  M  —  1- 
Vertical  dashed  line  indicates  “ensemble  averaging”  while  horizontal  dashed  line 
indicates  “temporal  averaging.” 

type  of  averaging  is  called  “averaging  down  the  ensemble”  and  consequently  is  just  a 
restatement  of  our  usual  notion  of  the  expected  value  of  a  random  variable.  However, 
if  we  are  given  only  a  single  realization  such  as  x\ [n],  then  it  seems  reasonable  that 

l  N~l 

Aat  =  —  ®iH 

n— 0 

should  also  converge  to  /j,  as  N  — >  oo.  This  type  of  averaging  is  called  “temporal 
averaging”  since  we  are  averaging  the  samples  in  time.  If  it  is  true  that  the  temporal 
average  converges  to  //.  then  we  can  state  that 

,  N- 1  M 

ilSo  N  E  "  =  ®IX118U  =  m  E 

n— 0  m— 1 
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and  it  is  said  that  temporal  averaging  is  equivalent  to  ensemble  averaging  or  that 
the  random  process  is  ergodic  in  the  mean .  This  property  is  of  great  practical 
importance  since  it  assures  us  that  by  averaging  enough  samples  of  the  realization, 
we  can  determine  the  mean  of  the  random  process.  For  the  case  of  an  IID  random 
process  ergodicity  holds  due  to  the  law  of  large  numbers  (see  Chapter  15).  Recall 
that  if  Xi,  X2,  •  •  • ,  Xn  are  IID  random  variables  with  mean  p  and  variance  <r2,  then 
the  sample  mean  random  variable  has  the  property  that 

1  N 

—  2^Xi  -»•  E[X)  =  n  as  N  ->  00. 

V  i= 1 

Hence,  if  X[n\  is  an  IID  random  process,  the  conditions  required  for  the  law  of  large 
numbers  to  hold  are  satisfied,  and  we  can  immediately  conclude  that 

-  N- 1 

=  -»•  (17-25) 
71=0 

Now  the  assumptions  required  for  a  random  process  to  be  IID  are  overly  restrictive 
for  (17.25)  to  hold.  More  generally,  if  X[n\  is  a  WSS  random  process,  then  since 
f?[Jf[n]]  =  /i,  it  follows  that  E[p n]  =  (1  /N)  Y^n=o  ^[^N]  ~  Therefore,  the  only 
further  condition  required  for  ergodicity  in  the  mean  is  that 

lim  var (pn)  —  0. 

N—>oo 


In  the  case  of  the  IID  random  process  it  is  easily  shown  that  var(/ijv)  =  cr2/N  -*  0 
as  N  00  and  the  condition  is  satisfied.  More  generally,  however,  the  random 
process  samples  are  correlated  so  that  evaluation  of  this  variance  is  slightly  more 
complicated.  We  illustrate  this  computation  next. 

Example  17.6  -  General  MA  random  process 

Consider  the  general  MA  random  process  given  as  X[n]  =  (U[n\  +  U[n  —  l])/2, 
where  i£[[/[n]]  =  p  and  var(C7[n])  =  o\j  for  —00  <  n  <  00  and  the  (7[n]’s  are 
all  uncorrelated.  This  is  similar  to  the  MA  process  of  Example  16.10  but  is  more 
general  in  that  the  mean  of  U[n\  is  not  necessarily  zero,  the  samples  of  U[n\  are  only 
uncorrelated,  and  hence,  not  necessarily  independent,  and  the  PDF  of  each  sample 
need  not  be  Gaussian.  The  general  MA  process  X[n]  is  easily  shown  to  be  WSS 
and  to  have  a  mean  sequence  px[n\  =  M  (see  Problem  17.20).  To  determine  if  it  is 
ergodic  in  the  mean  we  must  compute  the  var (px)  and  show  that  it  converges  to 
zero  as  N  — >  00.  Now 


var(/ijv)  —  var 
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Since  the  X[n\ s  are  now  correlated,  we  use  (9.26),  where  a  =  [no  a\ . . .  a^r- i]T  with 
an  =  1/iV,  to  yield 


w-i 


var(/iiv)  —  var  |  ^  anX[n]  =  aTCxa. 


(17.26) 


71—0 


The  covariance  matrix  has  ( i,j )  element 


[Cxh  =  E[(X\i)-E[X[i\])(m-E[Xm}  i  =  0, 1, ,  JV-1;  j  =  0,1,...,N-1 


But 


X[n]-£[X[n]]  =  l-{U[n]  +  U[n-  1])-|(m  +  m) 

=  ^[(^[n]  - //)  +  (C/[n  -  1]  - /^)] 

=  +  #[»  -  1]] 


where  f/[n]  is  a  zero  mean  random  variable  for  each  value  of  n.  Thus. 


[C*W  =  \E[(U\i]  +  U[i  -  mm  +  U\j  - 1})} 

=  \  {E[U\i]U[j]}  +  E[U\i}U[3  -  1]]  +  E[U[i  -  l]um  +  E[U[i  -  1  }U[j  -  1]]) 

and  since  E[U[ni\U[n2]\  =  cov(C/[ni],  U[n2])  =  crfjS[n2  -  nx]  (all  the  77[n]’s  are 
uncorrelated),  we  have 

tc x\ij  =  ^  (<TuSti  -  *]  +  aU$[j  -!-*]+  4s\j  -  i  +  1]  +  crjjS[j  -  i])  . 

Finally,  we  have  the  required  covariance  matrix 


[Cx]ij  - 


! 


2  aU 

\°U 

0 


i  =  j 

\i~j\  =  1 

otherwise. 


(17.27) 


Using  this  in  (17.26)  produces 


var(/ljv) 
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=  arCxS. 


=  [* 


l_  _i 

N  N 


N-l 

bE 


(7 


U 


7k 


7k 


N—2 


1 

N2  ^  2  +  W^T  +  N2  ^  4 

i= 0  i= 0  i=l 

au  ,  w  - 1 

I  j  t*  rO  1 


a 


t/ 


£& 


£bl 


0  0  ... 


£ll 


0  ... 


0  0  0  0... 

0  0  0  0... 


AT-1 

bE 


a 


u 


2N 


4  N2 


4  N2 


0 


as  N 


0 

0 


0 


0 

0 


Til 


Til 


oo. 


0 

0 


Tk 

4 

7k 


Finally,  we  see  that  the  general  MA  random  process  is  ergodic  in  the  mean. 

0 

In  general,  it  can  be  shown  that  for  a  WSS  random  process  to  be  ergodic  in  the 
mean,  the  variance  of  the  sample  mean 


var(MTv)  =  (l  ~  jf)  “  ^2)  (17-28) 

jfc=-(JV-l)  '  ' 

must  converge  to  zero  as  N  ->•  oo  (see  Problem  17.23  for  the  derivation  of  (17.28)). 
For  this  to  occur,  the  covariance  sequence  rx  [&]  —  /i2  must  decay  to  zero  at  a  fast 
enough  rate  as  k  — >  oo,  which  is  to  say  that  as  the  samples  are  spaced  further 
and  further  apart,  they  must  eventually  become  uncorrelated.  A  little  reflection  on 
the  part  of  the  reader  will  reveal  that  ergodicity  requires  a  single  realization  of  the 
random  process  to  display  the  behavior  of  the  entire  ensemble  of  realizations.  If  not, 
ergodicity  will  not  hold.  Consider  the  following  simple  nonergodic  random  process. 


Example  17.7  -  Random  DC  level 

Define  a  random  process  as  X[n]  =  A  for  -oo  <  n  <  oo,  where  A  ~  _A7(0, 1).  Some 
realizations  are  shown  in  Figure  17.8.  This  random  process  is  WSS  since 

Hx[n]  =  £?[X[n]]  =  E[A\  =  0  =  /i  —  oo  <  n  <  oo  (not  dependent  on  n) 

fx[k]  =  E[X[ri]X[n  +  A:]]  =  E[A2}  ~  1  (not  dependent  on  n). 

However,  it  should  be  clear  that  fix  will  not  converge  to  fi  =  0.  Referring  to  the 
realization  xi[n]  in  Figure  17.8,  the  sample  mean  will  produce  —0.43  no  matter  how 
large  N  becomes.  In  addition,  it  can  be  shown  that  var(/j/v)  =  1  (see  Problem 
17.24).  Each  realization  is  not  representative  of  the  ensemble  of  realizations. 
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Figure  17.8:  Several  realizations  of  the  random  DC  level  process. 


17.6  The  Power  Spectral  Density 

The  ACS  measures  the  correlation  between  samples  of  a  WSS  random  process.  For 
example,  the  AR  random  process  was  shown  to  have  the  ACS 

r x[k\  =  A-.1*1 

which  for  a  =  0.25  and  a  =  0.98  is  shown  in  Figure  17.6,  along  with  some  typical 
realizations  in  Figure  17.5.  Note  that  when  the  ACS  dies  out  rapidly  (see  Figure 
17.6a),  the  realization  is  more  rapidly  varying  in  time  (see  Figure  17.5a).  In  contrast, 
when  the  ACS  decays  slowly  (see  Figure  17.6b),  the  realization  varies  slowly  (see 
Figure  17.5b).  It  would  seem  that  the  ACS  is  related  to  the  rate  of  change  of  the 
random  process.  For  deterministic  signals  the  rate  of  change  is  usually  measured 
by  examining  a  discrete-time  Fourier  transform  [Jackson  1991].  Signals  with  high 
frequency  content  exhibit  rapid  fluctutations  in  time  while  signals  with  only  low 
frequency  content  exhibit  slow  variations  in  time.  For  WSS  random  processes  we 
will  be  interested  in  the  power  at  the  various  frequencies.  In  particular,  we  will 
introduce  the  measure  known  as  the  power  spectral  density  (PSD)  and  show  that  it 
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quantifies  the  distribution  of  power  with  frequency.  Before  doing  so,  however,  we 
consider  the  following  deterministically  motivated  measure  of  power  with  frequency 
based  on  the  discrete-time  Fourier  transform 


N- 1 

^2  exp(-j27r/n) 

n= 0 


2 


(17.29) 


This  is  a  normalized  version  of  the  magnitude-squared  discrete-time  Fourier  trans¬ 
form  of  the  random  process  over  the  time  interval  0  <  n  <  N  —  1.  It  is  called  the 
periodogram  since  its  original  purpose  was  to  find  periodicities  in  random  data  sets 
[Schuster  1898].  In  (17.29)  /  denotes  the  discrete-time  frequency,  which  is  assumed 
to  be  in  the  range  —1/2  <  /  <  1/2  for  reasons  that  will  be  elucidated  later.  The  1/N 
factor  is  required  to  normalize  Px(f)  to  be  interpretable  as  a  power  spectral  density 
or  power  per  unit  frequency.  The  use  of  a  “hat”  is  meant  to  convey  the  notion  that 
this  quantity  is  an  estimator.  As  we  now  show,  the  periodogram  is  not  a  suitable 
measure  of  the  distribution  of  power  with  frequency,  although  it  would  be  for  some 
deterministic  signals  (such  as  periodic  discrete-time  signals  with  period  N).  As  an 
example,  we  plot  Px(f )  in  Figure  17.9  for  the  realizations  given  in  Figure  17.5.  We 


/  / 


(a)  a  =  0.25,  afj  =  1  —  a2 


(b)  a  =  0.98,  erf/  =  1  —  a2 


Figure  17.9:  Periodogram  for  autoregressive  random  process  with  different  param¬ 
eters.  The  realizations  shown  in  Figure  17.5  were  used  to  generate  these  estimates. 

see  that  the  periodogram  in  Figure  17.9a  exhibits  many  random  fluctuations.  Other 
realizations  will  also  produce  similar  seemingly  random  curves.  However,  it  does 
seem  to  produce  a  reasonable  result — for  the  periodogram  in  Figure  17.9a  there  is 
more  high  frequency  power  than  for  the  periodogram  in  Figure  17.9b.  The  reason 
for  the  random  nature  of  the  plot  is  that  (17.29)  is  a  function  of  N  random  variables 
and  hence  is  a  random  variable  itself  for  each  frequency.  As  such,  it  exhibits  the 
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variability  of  a  random  process  for  which  the  usual  dependence  on  time  is  replaced 
by  frequency.  What  we  would  actually  like  is  an  average  measure  of  the  power  dis¬ 
tribution  with  frequency,  suggesting  the  need  for  an  expected  value.  Also,  to  ensure 
that  we  capture  the  entire  random  process  behavior,  an  infinite  length  realization  is 
required.  We  are  therefore  led  to  the  following  more  suitable  definition  of  the  PSD 


Px(f)  = 


lim  - E 

M— »oo  2M  +  1 


M 

Y  X [n]  exp (—j2irfn) 

n——M 


(17.30) 


The  function  Px(f )  is  called  the  power  spectral  density  (PSD)  and  when  integrated 
provides  a  measure  of  the  average  power  within  a  band  of  frequencies.  It  is  com¬ 
pletely  analogous  to  the  PDF  in  that  to  find  the  average  power  of  the  random  process 
in  the  frequency  band  /i  <  /  <  /2  we  should  find  the  area  under  the  PSD  curve. 


Fourier  analysis  of  a  random  process  yields  no  phase  information. 


In  our  definition  of  the  PSD  we  are  using  the  magnitude-squared  of  the  Fourier 
transform.  It  is  obvious  then,  that  the  PSD  does  not  tell  us  anything  about  the 
phases  of  the  Fourier  transform  of  the  random  process.  This  is  in  contrast  to  a 
Fourier  transform  of  a  deterministic  signal.  There  the  inverse  Fourier  transform  can 
be  viewed  as  a  decomposition  of  the  signal  into  sinusoids  of  different  frequencies 
with  deterministic  amplitudes  and  phases.  For  a  random  process  a  similar  decom¬ 
position  called  the  spectral  representation  theorem  [Brockwell  and  Davis  1987]  yields 
sinusoids  of  different  frequencies  with  random  amplitudes  and  random  phases.  The 
PSD  is  essentially  the  expected  value  of  the  power  of  the  random  sinusoidal  ampli¬ 
tudes  per  unit  of  frequency.  No  phase  information  is  retained  and  therefore  no  phase 
information  can  be  extracted  from  knowledge  of  the  PSD. 


We  next  give  an  example  of  the  computation  of  a  PSD. 


Example  17.8  -  White  noise 

Assume  that  X[n]  is  white  noise  (see  Example  17.2)  and  therefore,  has  a  zero  mean 
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and  ACS  rx[k]  =  cr28[k\.  Then, 


Px{f)  =  lim 


1 


oo  2 M  +  1 


E 


M 


M 


Y,  X[n]eMj^fn)  Y  X[m ]  exp(— j2'Kfm) 


L  n=-M 
M  M 


m=—M 


lim  —  , 

m-¥  oo  2  M  + 


^  J.Y  J.  1VJ. 

l£[X[n]X[ra]]  exp[— j27rf(m  —  n)]  (17.31) 


n——M  m=—M 


rx  [m—n] 


M  M 

I 

lim  , 

M-J-oo  2 M  + 

n=—M  m=—M 
M 

—  lim  ~  ^  -2 

M->  oo  2  M  + 


1  MM 

— -  ^  Y  -  n]  exp[-j27r/(m  -  n)] 

1  n=-.A 
1  M 

hn  Y 


n=—M 


—  lim  a2  —  a2. 

M-+oo 

Hence,  for  white  noise  the  PSD  is 


(17.32) 


Px(f)  =  (Jz  —  1/2  <  /  <  1/2. 

As  first  mentioned  in  Chapter  16  white  noise  contains  equal  contributions  of  average 
power  at  all  frequencies. 

0 

A  more  straightforward  approach  to  obtaining  the  PSD  is  based  on  knowledge  of 
the  ACS.  Prom  (17.31)  we  see  that 

j  MM 

Px W  =  aJ™,  2M  +  1  S  Y  rx[m-n]exp[-j2irf(m-n)\.  (17.33) 

n=—Mm=—M 

This  can  be  simplified  using  the  formula  (see  Problem  17.26) 

M  M  2  M 

Y  9[m-n]=  Y  (2M  +  1  ~  \k\)9[k] 

n=—M  m——M  k~—2M 

which  results  from  considering  g[m  —  n]  as  an  element  of  the  (2 M  +  1)  x  (2 M  +  1) 
matrix  G  with  elements  [G]mn  =  g[m-n]  for  m  =  —M,  ...,M  and  n  =  -M,  ...,M 
and  then  summing  all  the  elements.  Using  this  relationship  in  (17.33)  produces 


Px(f)  = 


lim  — 
M— >oo  2 M 


_  2  M 

— Y  Y  (ZM +  1 -\k\)rx[k]exp(-j2nfk) 


k=-2M 


2M  , 

=  lim  V  (  1 

M-¥  OO  '  \ 

k=—2M  v 


1*1 


2M  +  1 


exp(—j2irfk). 
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Assuming  that  YltL-oo  \rx  [A;]  <  oo,  the  limit  can  be  shown  to  produce  the  final 
result  (see  Problem  17.27) 


OO 

px(f)=  ^2  rx[k]exp(-j2nfk)  (17.34) 

k=— oo 

which  says  that  the  power  spectral  density  is  the  discrete-time  Fourier  transform 
of  the  ACS.  This  relationship  is  known  as  the  Wiener- Khinchine  theorem.  Some 
examples  follow. 

Example  17.9  —  White  noise 

From  Example  17.2  rx[k]  =  <J26[k]  and  so 


Px(f)  = 


OO 

^2  rx[k]  exp(-j2n-fk) 

k=— oo 
oo 

^2  cr25[k]exp(-j2irfk) 

k=—oo 


This  is  shown  in  Figure  17.10.  Note  that  the  total  average  power  in  X[n],  which  is 
v\ [0]  =  a2,  is  given  by  the  area  under  the  PSD  curve. 

Px(f) 


i 


a2 

■ 

1 

l 

2 

t 

2 

Figure  17.10:  PSD  of  white  noise. 


Example  17.10  —  AR  random  process 

From  (17.21)  we  have  that 


1 


rx[k]  - 


—  oo  <  k  <  oo 
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and  from  (17.34) 


OO 


Px{f)  =  rx[k]exp(-j2-7rfk ) 


k=— oo 

.2  oo 


G 


U 


1  —  a2 


y:  M  exp(-j2irfk) 


k=— oo 


<7 


C/ 


1  —  a2 


oo 


^  a  k  exp(-j2Trfk)  +  ^ak  exp(-j2nfk) 


G 


Lk=—oo 
oo 


U 


1  —  a2 


oo 


^  [a  exp  (j2nf)]k  +  ^[aexp(-j27r/)] 


k 


Lk—1 


A;=0 


Since  |aexp(±<;27r/)|  =  |a|  <  1,  we  can  use  the  formula  Y^h=k0  zk  =  zk° /(^  ~  z)  f°r 
z  a  complex  number  with  \z\  <  1  to  evaluate  the  sums.  This  produces 


Px(f)  = 


G 


U 


aexp(j2nf) 


+ 


1  —  a2  \l  —  aexp(j27r/)  '  1  —  aexp(—  j2nf) 

°u  aexp(j27r/)(l  -  aexp(-j27r/))  +  (1  -  aexp(j27r/)) 


1 


a4 


(1  -  aexp(j'27r/))(l  -  aexp(-j2nf)) 


G 


U 


a 4 


1  —  a2  |1  —  aexp(—  j2nf)\2 


G 


u 


1  -aexp(-j27r/)|2. 


(17.35) 


This  can  also  be  written  in  real  form  as 


Px(f) 


G 


U 


1  +  a2  —  2acos(27r/) 


1/2  <f  <  1/2. 


(17.36) 


For  a  =  0.25  and  a  =  0.98  and  cr^  =  1  —  a2,  the  PSDs  are  plotted  in  Figure 
17.11.  Note  that  the  total  average  power  in  each  PSD  is  the  same,  being  Mo]  = 
0^/(1  —  a?)  =  1.  As  expected  the  more  noise-like  random  process  has  a  PSD  (see 
Figure  17.11a)  with  more  high  frequency  average  power  than  the  slowly  varying 
random  process  (see  Figure  17.11b)  which  has  all  its  average  power  near  /  =  0  (or 
at  DC). 

0 

From  the  previous  example,  we  observe  that  the  PSD  exhibits  the  properties  of 
being  a  real  nonnegative  function  of  frequency,  consistent  with  our  notion  of  power 
as  a  nonnegative  physical  quantity,  of  being  symmetric  about  /  =  0,  and  of  being 
periodic  with  period  one  (see  (17.36)).  We  next  prove  that  these  properties  are  true 
in  general. 
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/ 


(a)  a  =  0.25,  crj)  —  1  —  a2  (b)  a  =  0.98,  a\j  =  1  —  a2 

Figure  17.11:  Power  spectral  densities  for  autoregressive  random  process  with  differ¬ 
ent  parameters.  The  periodograms,  which  are  estimated  PSDs,  were  given  in  Figure 
17.9. 


Property  17.7  -  PSD  is  a  real  function. 

The  PSD  is  also  given  by  the  real  function 

oo 

Px(f)  =  ^2  rx[k] cos(2n  fk).  (17.37) 

k— — oo 

Proof: 


oo 

Px(f)  =  rx[k]exp(-j2nfk) 

k=—oo 

oo 

=  rx[k](cos(2Trfk)  -  jsm(2irfk)) 

k——oo 

oo  oo 

=  ^2  rx[k]cos(2'Kfk)-j  ^2  rx[k]sm(2nfk). 

k— — oo  k=—oo 


oo  — 1  oo 

rx  [k]  sin(27T/ k)  =  ^  rx  [fc]  sm(2n  fk)  +  rx  [fc]  sin(27r  fk) 

k=—oo  k=— oo  k=l 


But 
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since  the  k  =  0  term  is  zero,  and  letting  l  —  —k  in  the  first  sum  we  have 


OO 


OO 


OO 


y]  rx[k]sin(2nfk)  =  y^rx[-/]  sm(2nf(-l))  +  ^rx[k]  sm(2nfk) 


k—— oo 


1=1 

oo 


k=  1 


T:  rx[k](- sin(2TT  fk)  +  sin(2ir  fk))  =  0  ( rx[~l ]  =  rx[l ]) 


from  which  (17.37)  follows. 


□ 


Property  17.8  -  PSD  is  nonnegative. 

Px(f)>  0 

Proof:  Follows  from  (17.30)  but  can  also  be  shown  to  follow  from  the  positive 
semidefinite  property  of  the  ACS  [Brockwell  and  Davis  1987].  (See  also  Problem 
17.19.) 

□ 


Property  17.9  —  PSD  is  symmetric  about  /  =  0. 

Px(~f)  =  Px(f) 

Proof:  Follows  from  (17.37). 


□ 


Property  17.10  —  PSD  is  periodic  with  period  one. 

Px(f  +  l)=Px(f) 

Proof:  From  (17.37)  we  have 

oo 

Px{f  +  1)  =  ^  rx[k]  cos(2n(f  +  l)k) 

k— — oo 
oo 

=  2^  rx[k]  cos(27 xfk  +  2-nk) 

k=—oo 

oo 

=  rx[k]  cos(2tt fk)  (cos(27 rA:)  =  1,  sin(27rA;)  =  0) 

k=—oo 

=  PxU ) 


□ 
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Property  17.11  -  ACS  recovered  from  PSD  using  inverse  Fourier  trans¬ 
form 

rx[k]  =  f  Px{f)exp(j2nfk)df  —  oo  <  k  <  oo  (17.38) 

=  j  Px(f)  cos(27r  fk)df  —  oo  <  k  <  oo  (17.39) 

Proof:  (17.38)  follows  from  properties  of  discrete-time  Fourier  transform  [Jackson 
1991].  (17.39)  follows  from  Property  17.9  (see  Appendix  B.5  and  also  Problem 
17.49). 

□ 


Property  17.12  —  PSD  yields  average  power  over  band  of  frequencies. 

To  obtain  the  average  power  in  the  frequency  band  /i  <  /  <  /2  we  need  only  find 
the  area  under  the  PSD  curve  for  this  band.  The  average  physical  power  is  obtained 
as  twice  this  area  since  the  negative  frequencies  account  for  half  of  the  average  power 
(recall  Property  17.9).  Hence, 


rf2 

Average  physical  power  in  [fu  f2]  =  2  /  Px{f)df.  (17.40) 

Jfi 

The  proof  of  this  property  requires  some  concepts  to  be  described  in  the  next  chapter, 
and  thus,  we  defer  the  proof  until  Section  18.4.  Note,  however,  that  if  fi  =  0  and 
/2  =  1/2,  then  the  average  power  in  this  band  is 


Average  physical  power  in  [0, 1/2] 


r  1/2 

2  /  Px(f)df 
Jo 

/  Px{f)df  (due  to  symmetry  of  PSD) 

f  i  Px(/)exp(;27r/(0))e(f 
^  ~  2 

r^[0]  (from  (17.38)) 


which  we  have  already  seen  yields  the  total  average  power  since  rx [0]  =  E[X2[n]}. 
Hence,  we  see  that  the  total  average  power  is  obtained  by  integrating  the  PSD  over 
all  frequencies  to  yield 


rx[0]  =  /  “  Px(f)df. 
J~2 


(17.41) 
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Definitions  of  PSD  are  not  consistent. 


In  some  texts,  especially  ones  describing  the  use  of  the  PSD  for  physical  measure¬ 
ments,  the  definition  of  the  PSD  is  slightly  different.  The  alternative  definition  relies 
on  the  relationship  of  (17.40)  to  define  the  PSD  as  Gx(f )  —  2 Px{f)-  It  is  called  the 
one-sided  PSD  and  its  advantage  is  that  it  yields  directly  the  average  power  over  a 
band  when  integrated  over  the  band.  As  can  be  seen  from  (17.40) 

rf2 

Average  physical  power  in  [/i,  /2]  =  /  Gx{f)df. 

Jfx 


A 

A  final  comment  concerns  the  periodicity  of  the  PSD.  We  have  chosen  the  fre¬ 
quency  interval  [—1/2, 1/2]  over  which  to  display  the  PSD.  The  rationale  for  this 
choice  arises  from  the  practical  situation  in  which  a  continuous-time  WSS  random 
process  (see  Section  17.8)  is  sampled  to  produce  a  discrete-time  WSS  random  pro¬ 
cess.  Then,  if  the  continuous-time  random  process  X{t)  has  a  PSD  that  is  bandlim- 
ited  to  W  Hz  and  is  sampled  at  Fs  samples/sec,  the  discrete-time  PSD  Px(f)  will 
have  discrete-time  frequency  units  of  W/Fs.  For  Nyquist  rate  sampling  of  Fs  —  2 W, 
the  maximum  discrete-time  frequency  will  be  /  =  W/Fs  =  1/2.  Hence,  our  choice 
of  the  frequency  interval  [—1/2, 1/2]  corresponds  to  the  continuous-time  frequency 
interval  of  [— W,  W]  Hz.  The  discrete-time  frequency  is  also  referred  to  as  the  nor¬ 
malized  frequency ,  the  normalizing  factor  being  Fs. 

17.7  Estimation  of  the  ACS  and  PSD 

Recall  from  our  discussion  of  ergodicity  that  in  the  problem  of  mean  estimation 
for  a  WSS  random  process,  we  were  restricted  to  observing  only  a  finite  number  of 
samples  of  one  realization  of  the  random  process.  If  the  random  process  is  ergodic 
in  the  mean,  then  we  saw  that  as  the  number  of  samples  increases  to  infinity,  the 
temporal  average  fix  will  converge  to  the  ensemble  average  /i.  To  apply  this  result 
to  estimation  of  the  ACS  consider  the  problem  of  estimating  the  ACS  for  lag  k  =  ko 
which  is 

rx[k o]  =  E[X[n]X[n  +  fc0]]. 

Then  by  defining  the  product  random  process  Y[n\  =  X[n]X[n  +  A;q]  we  see  that 


rx[k0]  =  E[Y[n}\ 


—  oc  <  n  <  oo 
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or  the  desired  quantity  to  be  estimated  is  just  the  mean  of  the  random  process  Y[n]. 
The  mean  of  Y[n]  does  not  depend  on  n.  This  suggests  that  we  replace  the  observed 
values  of  X[n]  with  those  of  Y[n\  by  using  y[n\  =  x[n]x[n  +  ko ],  and  then  use  a 
temporal  average  to  estimate  the  ensemble  average.  Hence,  we  have  the  temporal 
average  estimate 


x 

rx[ko]  =  —  y[n] 

71=0 
^  N—l 

=  —  ^  x[n]x[n  +  ko}.  (17.42) 

71=0 

Also,  since  rx[— k]  =  rx[k ],  we  need  only  estimate  the  ACS  for  k  >  0.  There 
is  one  slight  modification  that  we  need  to  make  to  the  estimate.  Assuming  that 
{x  [0]  ,x  [1]  , . . . ,  x[N  —  1]}  are  observed,  we  must  choose  the  upper  limit  on  the  sum¬ 
mation  in  (17.42)  to  satisfy  the  constraint  n  +  ko  <  N  —  1.  This  is  because  x[n  +  ko ] 
is  unobserved  for  n  +  ko  >  N  —  1.  With  this  modification  we  have  as  our  estimate 
of  the  ACS  (and  now  replacing  the  specific  lag  of  ko  by  the  more  general  lag  k) 

-  N-l-k 

fx  [fc]  =  ~jy  —  v  T  x[n]x{n  +  k]  k  =  0, 1, . . . ,  N  —  1.  (17.43) 


We  have  also  changed  the  1/N  averaging  factor  to  1  /(N  —  k).  This  is  because  the 
number  of  terms  in  the  sum  is  only  N  —  k.  For  example,  if  N  =  4  so  that  we  observe 
{x[0],a;[l],a;[2],a;[3]},  then  (17.43)  yields  the  estimates 

rx  [0]  =  ^(x2[0]  +  x2[\]  +  x2[2\  +  a;2  [3]) 

C>dl]  =  ^(a:[0]:r[l]  -I-  a;[l]a;[2]  +  a;[2]x[3]) 

^x[2]  =  ^(a;[0]x[2]  +z[l]a;[3]) 

fx[3]  =  s[0]x[3]. 

As  k  increases,  the  distance  between  the  samples  increases  and  so  there  are  less 
products  available  for  averaging.  In  fact,  for  k  >  N—l,  we  cannot  estimate  the  value 
of  the  ACS  at  all.  With  the  estimate  given  in  (17.43)  we  see  that  E[rx [&]]  =  rx[k] 
for  k  =  0. 1, ... .  N  —  1.  In  order  for  the  estimate  to  converge  to  the  true  value  as 
N  — >■  oo,  i.e,  for  the  random  process  to  be  ergodic  in  the  autocorrelation  or 


lim  rx[k]  = 

N-+  OO 


1 


lim  - 

N-+  oo  N  -  k 


N-l-k 

Y]  x[n]x[n  +  k]  =  rx[k] 

71=0 


k  =  0, 1, . . . 


we  require  that  v&r(fx [A:])  -4  0  as  N  ->  oo.  This  will  generally  be  true  if  rx[k]  -»  0 
as  k  -4  oo  for  a  zero  mean  random  process  but  see  Problem  17.25  for  a  case  where 
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this  is  not  required.  To  illustrate  the  estimation  performance  consider  the  AR 
random  process  described  in  Example  17.5.  The  true  ACS  and  the  estimated  one 
using  (17.43)  and  based  on  the  realizations  shown  in  Figure  17.5  are  shown  in 
Figure  17.12.  The  estimated  ACS  is  shown  as  the  dark  lines  while  the  true  ACS  as 
given  by  (17.21)  is  shown  as  light  lines,  which  are  slightly  displaced  to  the  right  for 
easier  viewing.  Note  that  in  Figure  17.12  the  estimated  values  for  k  large  exhibit 


(a)  a  =  0.25,  ofr  =  1  -  a2  (b)  a  =  0.98,  af,  =  1  -  a2 

Figure  17.12:  Estimated  ACSs  (dark  lines)  and  the  true  ACSs  given  in  Figure  17.6 
(light  lines)  for  the  AR  random  process  realizations  shown  in  Figure  17.5. 

a  large  error.  This  is  due  to  the  fewer  number  of  products,  i.e.,  N  —  k  =  31  -  k, 
that  are  available  for  averaging  in  (17.43).  In  the  case  of  k  —  30  the  estimate  is 
i~'X  [30]  =  .i'[0]x[30],  which  as  you  might  expect  is  very  poor  since  there  is  no  averaging 
at  all!  Clearly,  for  accurate  estimates  of  the  ACS  we  require  that  fcmax  <C  N.  The 
MATLAB  code  used  to  estimate  the  ACS  for  Figure  17.12  is  given  below. 

n=[0:30] ’ ;N=length(n) ; 

al=0.25;a2=0.98; 

varul=l-al"2 ; varu2=l-a2~2 ; 

rltrue=(varul/(l-al~2))*al."n;  */,  see  (17.21) 

r2true=(varu2/(l-a2~2))*a2. ~n; 

for  k=0:N-l 

rlest (k+1 , l)=(l/(N-k))*sum(xl(l :N-k) .*xl(l+k:N)) ; 
r2est(k+l,l)=(l/ (N-k))*sum(x2(l:N-k) .*x2(l+k:N)) ; 

end 

To  estimate  the  PSD  requires  somewhat  more  care  than  the  ACS.  We  have 
already  seen  that  the  periodogram  estimate  of  (17.29)  is  not  suitable.  There  are 
many  ways  to  estimate  the  PSD  based  on  either  (17.30)  or  (17.34).  We  illustrate 
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one  approach  based  on  (17.30).  Others  may  be  found  in  [Jenkins  and  Watts  1968, 
Kay  1988].  Since  we  only  have  a  segment  of  a  single  realization  of  the  random 
process,  we  cannot  implement  the  expectation  operation  required  in  (17.30).  Note 
that  the  operation  of  £?[■]  represents  an  average  down  the  ensemble  or  equivalently 
an  average  over  multiple  realizations.  To  obtain  some  averaging,  however,  we  can 
break  up  the  data  {x  [0],ar[l]  , . . . ,  x[N  —  1]}  into  I  nonoverlapping  blocks,  with  each 
block  having  a  total  of  L  samples.  We  assume  for  simplicity  that  there  is  an  integer 
number  of  blocks  so  that  N  =  IL.  The  implicit  assumption  in  doing  so  is  that  each 
block  exhibits  the  statistical  characteristics  of  a  single  realization  and  so  we  can 
mimic  the  averaging  down  the  ensemble  by  averaging  temporally  across  successive 
blocks  of  data.  Once  again,  the  assumption  of  ergodicity  is  being  employed.  Thus, 
we  first  break  up  the  data  set  into  the  I  nonoverlapping  data  blocks 

yi[n]  =  x[n  +  iL\  n  =  0, 1, . . . ,  L  -  1;  i  =  0, 1, . . . ,  I  —  1 

where  each  data  block  has  a  length  of  L  samples.  Then,  for  each  data  block  we 
compute  a  periodogram  as 

1  L-i  2 

PxU)  =  jj  Y2  exp(-j27r/n)  (17.44) 

71=0 

and  then  average  all  the  periodograms  together  to  yield  the  final  PSD  estimate  as 

hAf)=\'£P$>U)-  (17.45) 

1  2=0 

This  estimate  is  called  the  averaged  periodogram.  It  can  be  shown  that  under  some 

/N 

conditions,  limN-+oo  Pzv(f)  =  Px(f)-  Once  again  we  are  calling  upon  an  ergodicity 
type  of  property  in  that  we  are  averaging  the  periodograms  obtained  in  time  instead 
of  the  theoretical  ensemble  averaging.  Of  course,  for  convergence  to  hold  as  N  — >■  oo, 
we  must  have  L  oo  and  I  — ¥  oo  as  well. 

As  an  example,  we  examine  the  averaged  periodogram  estimates  for  the  two  AR 
processes  whose  PSDs  are  shown  in  Figure  17.11.  The  number  of  data  samples  was 
N  —  310,  which  was  broken  up  into  I  =  10  nonoverlapping  blocks  of  data  with 
L  =  31  samples  in  each  one.  By  comparing  the  spectral  estimates  in  Figure  17.13 
with  those  of  Figure  17.9,  it  is  seen  that  the  averaging  has  yielded  a  better  estimate. 
Of  course,  the  price  paid  is  that  the  data  set  needs  to  be  I  =  10  times  as  long! 
The  MATLAB  code  used  to  implement  the  averaged  periodogram  estimate  is  given 
next.  A  fast  Fourier  transform  (FFT)  is  used  to  compute  the  Fourier  transform  of 
the  yi[n]  sequences  at  the  frequencies  /  =  -0.5  +  kAf,  where  k  =  0, 1, ... ,  1023  and 
Af  =  1/1024  (see  [Kay  1988]  for  a  more  detailed  description). 
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(a)  a  =  0.25,  afj  =  l  —  a2 


(b)  a  =  0.98,  (Tu  =  1  —  a2 


Figure  17.13:  Power  spectral  density  estimates  using  the  averaged  periodogram 
method  for  autoregressive  processes  with  different  parameters.  The  true  PSDs  are 
shown  in  Figure  17.11. 


Nf ft=1024;  7.  set  FFT  size 

Pavl=zeros(Nfft , 1) ;Pav2=Pavl ;  7®  set  up  arrays  with  desired  dimension 
f=[0:Nfft-l]  VNfft-0.5;  7,  set  frequencies  for  later  plotting 

#/.  of  PSD  estimate 


for  i=0:I-l 


nstart=l+i*L;nend=L+i*L;  7#  set  up  beginning  and  end  points 

7o  of  ith  block  of  data 

yl=xl (nstart :nend) ; 
y2=x2(nstart :nend) ; 

7.  take  FFT  of  block,  since  FFT  outputs  samples  of  Fourier 
7«  transform  over  frequency  range  [0,1),  must  shift  FFT  outputs 
7o  for  [1/2,1)  to  [-1/2,  0),  then  take  complex  magnitude-squared, 
7«  normalize  by  L  and  average 

Pavl=Pavl+(l/ (I*L) ) *abs (f f tshif t (f f t (yl ,Nf f t) ) ) . "2 ; 
Pav2=Pav2+(l/ (I*L) ) *abs (f ft shift (f f t (y2 ,Nf f t) ) ) . "2 ; 
end 


17.8  Continuous-Time  WSS  Random  Processes 

In  this  section  we  give  the  corresponding  definitions  and  formulas  for  continuous¬ 
time  WSS  random  processes.  A  more  detailed  description  can  be  found  in  [Papoulis 
1965].  Also,  an  important  example  is  described  to  illustrate  the  use  of  these  formulas. 
A  continuous-time  random  process  A(t)  for  — oo  <  t  <  oo  is  defined  to  be  WSS 
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if  the  mean  function  fix  ( t )  satisfies 

fix(t)  —  E[X(t)]  —  fi  —  oo  <  t  <  oo  (17.46) 

which  is  to  say  it  is  constant  in  time  and  an  autocorrelation  function  (ACF)  can  be 
defined  as 

rx{j)  =  E[X(t)X{t  +  r)]  —  oo  <  r  <  oo  (17.47) 

which  is  not  dependent  on  the  value  of  t.  Thus,  E[X(ti)X(t2)\  depends  only  on 
\t2  —  ti\.  Note  the  use  of  the  “parentheses”  indicates  that  the  argument  of  the  ACF 
is  continuous  and  serves  to  distinguish  rx[k\  from  rx{r).  The  ACF  has  the  following 
properties. 

Property  17.13  —  ACF  is  positive  for  the  zero  lag  or  rx(0)  >  0. 

The  total  average  power  is  r\ (0)  =  E[X2(t)]. 

□ 


Property  17.14  -  ACF  is  an  even  function  or  rx(—r )  =  rx(r). 


□ 


Property  17.15  —  Maximum  value  of  ACF  is  at  r  =  0  or  \rx  (t)|  <  rx(0). 

□ 


Property  17.16  —  ACF  measures  the  predictability  of  a  random  process. 

The  correlation  coefficient  for  two  samples  of  a  zero  mean  WSS  random  process  is 


PX(t),X(t+T) 


rxjr) 
rx  (0)  ‘ 


Property  17.17  —  ACF  approaches  fi2  as  r  — >  oo. 

This  assumes  that  the  samples  become  uncorrelated  for  large  lags,  which  is  usually 
the  case. 

□ 


Property  17.18  -  rx(r)  is  a  positive  semidefinite  function. 

See  [Papoulis  1965]  for  the  definition  of  a  positive  semidefinite  function.  This 
property  assumes  that  the  some  samples  of  X  ( t )  may  be  perfectly  predictable.  If  it 
is  not,  then  the  ACF  is  positive  definite. 


□ 
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The  PSD  is  defined  as 

—  oo  <  F  <  oo  (17.48) 

where  F  is  the  frequency  in  Hz.  We  use  a  capital  F  to  denote  continuous-time 
or  analog  frequency.  By  the  Wiener-Khinchine  theorem  this  is  equivalent  to  the 
continuous-time  Fourier  transform  of  the  ACF 

/oo 

tx(t)  exp(— j2nFr)dr  (17.49) 

-oo 

/oo 

vx(t)  cos(27Tj Fr)dr.  (17.50) 

-oo 

(See  also  Problem  17.49.)  The  PSD  has  the  usual  interpretation  as  the  average 
power  distribution  with  frequency.  In  particular,  it  is  the  average  power  per  Hz. 
The  average  physical  power  in  a  frequency  band  [F\ ,  F2]  is  given  by 

pF2 

Average  physical  power  in  [F\,  F2]  =  2  /  Px(F)dF 

JFi 

where  again  the  2  factor  reflects  the  additional  contribution  of  the  negative  frequen¬ 
cies.  The  properties  of  the  PSD  are  as  follows: 

Property  17.19  -  PSD  is  a  real  function. 

The  PSD  is  given  by  the  real  function 

/oo 

rx(j)  cos(27r Fr)dr 
-00 


1 

fT/2 

2 

Px(F)  =  lim  —E 

V  '  T—*oo  T 

/  X(t)  exp(— j2irFt)dt 

J-T/2 

□ 


Property  17.20  —  PSD  is  nonnegative. 

Px  ( F )  >  0 


□ 


Property  17.21  -  PSD  is  symmetric  about  F  =  0. 

Px(-F)  =  Px(F ) 


□ 
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Property  17.22  —  ACF  recovered  from  PSD  using  inverse  Fourier  trans¬ 
form 


/OO 

Px(F)  exp(j27rFr)dF 

-oo 


■oo 

•oo 


/oo 

Px(F)  cos(27r Fr)dF 

-oo 


—  oo  <  r  <  oo 

—  OO  <  T  <  OO. 


(17.51) 

(17.52) 


(See  also  Problem  17.49.) 


Unlike  the  PSD  for  a  discrete-time  WSS  random  process,  the  PSD  for  a  continuous¬ 
time  WSS  random  process  is  not  periodic.  We  next  illustrate  these  definitions  and 
formulas  with  an  example  of  practical  importance. 


Example  17.11  —  Obtaining  discrete-time  WGN  from  continuous-time 
WGN 

A  common  model  for  a  continuous-time  noise  random  process  X(t)  in  a  physical 
system  is  a  WSS  random  process  with  a  zero  mean.  In  addition,  due  to  the  origin  of 
noise  as  microscopic  fluctuations  of  a  large  number  of  electrons,  or  molecules,  etc., 
a  central  limit  theorem  can  be  employed  to  assert  that  X  ( t )  is  a  Gaussian  random 
variable  for  all  t.  The  average  power  of  the  noise  in  a  band  of  frequencies  is  observed 
to  be  the  same  for  all  bands  up  to  some  upper  frequency  limit,  at  which  the  average 
power  begins  to  decrease.  For  instance,  consider  thermal  noise  in  a  conductor  due  to 
random  fluctuations  of  the  electrons  about  some  mean  velocity.  The  average  power 
versus  frequency  is  predicted  by  physics  to  be  constant  until  a  cutoff  frequency  of 
about  Fc  =  1000  GHz  at  room  temperature  [Bell  Telephone  Labs  1970].  Hence, 
we  can  assume  that  the  PSD  of  the  noise  has  a  PSD  shown  in  Figure  17.14  as 
the  true  PSD.  To  further  simplify  the  mathematically  modeling  without  sacrificing 
the  realism  of  the  model,  we  can  observe  that  all  physical  systems  will  only  pass 
frequency  components  that  are  much  lower  than  Fc — typically  the  bandwidth  of 
the  system  is  W  Hz  as  shown  in  Figure  17.14.  Any  frequencies  above  W  Hz  will 
be  cut  off  by  the  system.  Therefore,  the  noise  output  of  the  system  will  be  the 
same  whether  we  use  the  true  PSD  or  the  modeled  one  shown  in  Figure  17.10.  The 
modeled  PSD  is  given  by 


Px(F)  = 


N0 

2 


—  oo  <  F  <  oo. 


This  is  clearly  a  physically  impossible  PSD  in  that  the  total  average  power  is 
rx(0)  =  f^°OQPx(F)df  =  oo.  However,  its  use  simplifies  much  systems  analysis 
(see  Problem  17.50).  The  corresponding  ACF  is  from  (17.51)  the  inverse  Fourier 
transform,  which  is 


rx(r) 


(17.53) 
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Px(F ) 


Figure  17.14:  True  and  modeled  PSDs  for  continuous-time  white  Gaussian  noise. 


and  is  seen  to  be  an  impulse  at  r  =  0.  Again  the  nonphysical  nature  of  this  model 
is  manifest  by  the  value  rx(0)  =  oo.  A  continuous-time  WSS  Gaussian  random 
process  with  zero  mean  and  the  ACF  given  by  (17.53)  is  called  continuous-time 
white  Gaussian  noise  (WGN)  (see  also  Example  20.6).  It  is  a  standard  model  in 
many  disciplines. 

Now  as  was  previously  mentioned,  all  physical  systems  are  bandlimited  to  W  Hz, 
which  is  typically  chosen  to  ensure  that  a  desired  signal  with  a  bandwidth  of  W  Hz 
is  not  distorted.  Modern  signal  processing  hardware  first  bandlimits  the  continuous¬ 
time  waveform  to  a  maximum  of  W  Hz  using  a  lowpass  filter  and  then  samples  the 
output  of  the  filter  at  the  Nyquist  rate  of  Fs  =  2 W  samples/sec.  The  samples  are 
then  input  into  a  digital  computer.  An  important  question  to  answer  is:  What  are 
the  statistical  characteristics  of  the  noise  samples  that  are  input  to  the  computer? 
To  answer  this  question  we  let  A t  be  the  time  interval  between  successive  samples. 
Also,  let  X(t )  be  the  noise  at  the  output  of  an  ideal  lowpass  filter  ( H(F )  =  1  for 
|F|  <  W  and  H(F)  =  0  for  |F*|  >  W)  over  the  system  passband  shown  in  Figure 
17.14.  Then,  the  noise  samples  can  be  represented  as 


X(t)\t=nAt  ~  X[n]  for  —  oo  <  n  <  oo. 


Since  X(t)  is  bandlimited  to  W  Hz  and  prior  to  filtering  had  the  modeled  PSD 
shown  in  Figure  17.14,  its  PSD  is 


Px(F)  = 


^  \F\<W 
0  \F\  >  W. 


The  noise  samples  X[n ]  comprise  a  discrete-time  random  process.  Its  characteristics 
follow  those  of  X(t).  Since  X(t)  is  Gaussian,  then  so  is  X[n]  (being  just  a  sample). 
Also,  since  X(t)  is  zero  mean,  so  is  X[n]  for  all  n.  Finally,  we  inquire  as  to  whether 
X[n]  is  WSS,  i.e.,  can  we  define  an  ACS?  To  answer  this  we  first  note  that  X[n ]  = 
X (nAt)  and  recall  that  X(t)  is  WSS.  Then  from  the  definition  of  the  ACS 


E[X[n\X[n  +  k}\  =  E[X(nAt)X{(n  +  k)At)} 

=  rx{kAt)  (definition  of  continuous-time  ACF) 
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which  does  not  depend  on  n,  and  so  X[ri\  is  a  zero  mean  discrete-time  WSS  random 
process  with  ACS 

rx[k]  =  rx(kAt).  (17.54) 

It  is  seen  to  be  a  sampled  version  of  the  continuous-time  ACF.  To  explicitly  evaluate 
the  ACS  we  have  from  (17.51) 


rx{r) 


L 

L 


oo 

-oo 

W 


Px  ( F )  exp(j2nFT)dF 


N0 


N0 


w  2 
w 


exp(j2irFT)dF 


/VV 

cos(27t.Ft)gLF  (sine  component  is  odd  function) 
-w 


Nq  sin(27 tFt) 


w 


N0W 


2ttt  \_w 

sin(27r  Wt) 

27 tWt 


(17.55) 


which  is  shown  in  Figure  17.15.  Now  since  rx[k]  =  rx(kAt)  =  rx(k/(2W)),  we 


Figure  17.15:  ACF  for  bandlimited  continuous-time  WGN  with  NqW  —  1. 

see  from  Figure  17.15  that  for  k  =  ±1,  ±2, . . .  the  ACS  is  zero,  being  the  result  of 
sampling  the  continuous-time  ACF  at  its  zeros.  The  only  nonzero  value  is  for  k  =  0, 
which  is  rx[0]  =  rx( 0)  =  NqW  from  (17.55).  Therefore,  we  finally  observe  that  the 
ACS  of  the  noise  samples  is 


rx[k\  =  N0WS[k]. 


(17.56) 
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The  discrete-time  noise  random  process  is  indeed  WSS  and  has  the  ACS  of  (17.56). 
The  PSD  corresponding  to  this  ACS  has  already  been  found  and  is  shown  in  Figure 
17.10,  where  a2  =  NoW.  Therefore,  X[n]  is  a  discrete-time  white  Gaussian  noise 
random  process.  This  example  justifies  the  use  of  the  WGN  model  for  discrete-time 
systems  analysis. 

0 


Sampling  faster  gives  only  marginally  better  performance. 


It  is  sometimes  argued  that  by  sampling  the  output  of  a  system  lowpass  filter  whose 
cutoff  frequency  is  W  Hz  at  a  rate  greater  than  2 W,  we  can  improve  the  performance 
of  a  signal  processing  system.  For  example,  consider  the  estimation  of  the  mean  p 
based  on  samples  Y[n]  =  p  +  X[n]  for  n  =  0,1,...,  iV  —  1  where  £?[X[n]]  =  0, 
var(X[n])  =  cr2,  and  the  X[n]  samples  are  uncorrelated.  The  obvious  estimate  is 
the  sample  mean  or  (1  /N)  ^[nL  whose  expectation  is  p  and  whose  variance 

is  a2/N.  Clearly,  if  we  could  increase  iV,  then  the  variance  could  be  reduced  and  a 
better  estimate  would  result.  This  suggests  sampling  the  continuous-time  random 
process  at  a  rate  faster  than  2 W  samples/sec.  The  fallacy,  however,  is  that  as 
the  sampling  rate  increases,  the  noise  samples  become  correlated  as  can  be  seen  by 
considering  a  sampling  rate  of  4 W  for  which  the  time  interval  between  samples 
becomes  r  =  A$/2  =  1/(4W).  Then,  as  observed  from  Figure  17.15,  the  correlation 
between  successive  samples  is  rx(l/(4W))  =  0.6.  In  effect,  by  sampling  faster  we 
are  not  obtaining  any  new  realizations  of  the  noise  samples  but  nearly  repetitions 
of  the  same  noise  samples.  As  a  result,  the  variance  will  not  decrease  as  1/N  but  at 
a  slower  rate  (see  also  Problem  17.51). 

A 


17.9  Real-World  Example  -  Random  Vibration  Testing 

Anyone  who  has  ever  traveled  in  a  jet  knows  that  upon  landing,  the  cabin  can 
vibrate  greatly.  This  is  due  to  the  air  currents  outside  the  cabin  which  interact  with 
the  metallic  aircraft  surface.  These  pressure  variations  give  rise  to  vibrations  which 
are  referred  to  as  turbulent  boundary  layer  noise .  A  manufacturer  that  intends  to 
attach  an  antenna  or  other  device  to  an  aircraft  must  be  cognizant  of  this  vibration 
and  plan  for  it.  It  is  customary  then  to  subject  the  antenna  to  a  random  vibration 
test  in  the  lab  to  make  sure  it  is  not  adversely  affected  in  flight  [McConnell  1995]. 
To  do  so  the  antenna  would  be  mounted  on  a  shaker  table  and  the  table  shaken  in 
a  manner  to  simulate  the  turbulent  boundary  layer  (TBL)  noise.  The  problem  the 
manufacturer  faces  is  how  to  provide  the  proper  vibration  signal  to  the  table,  which 
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presumably  will  then  be  transmitted  to  the  antenna.  We  now  outline  a  possible 
solution  to  this  problem. 

The  National  Aeronautics  and  Space  Administration  (NASA)  has  determined 
PSD  models  for  the  TBL  noise  through  physical  modeling  and  experimentation. 
A  reasonable  model  for  the  one-sided  PSD  of  TBL  noise  upon  reentry  of  a  space 
vehicle,  such  as  the  space  shuttle,  into  the  earth’s  atmosphere  is  given  by  [NASA 
2001] 


GX(F)  = 


Gx( 500)  0  <  F  <  500  Hz 

500  <  F  <  50000  Hz 


where  r  represents  a  reference  value  which  is  20/iPa.  A  /i Pa  is  a  unit  of  pressure 
equal  to  10-6  nt/m2.  This  PSD  is  shown  in  Figure  17.16  referenced  to  the  standard 
unit  so  that  r  =  1.  Note  that  it  has  a  lowpass  type  of  characteristic.  In  order 


in10 

x  10 


Figure  17.16:  Continuous-time  one-sided  PSD  for  TBL  noise. 

to  provide  a  signal  to  the  shaker  table  that  is  random  and  has  the  PSD  shown 
in  Figure  17.16,  we  will  assume  that  the  signal  is  produced  in  a  digital  computer 
and  then  converted  via  a  digital-to-analog  convertor  to  a  continuous-time  signal. 
Hence,  we  need  to  produce  a  discrete-time  WSS  random  process  within  the  computer 
that  has  the  proper  PSD.  Recalling  our  discussion  in  Section  17.8  we  know  that 
rx  [&]  =  rx(kAt)  and  since  the  highest  frequency  in  the  PSD  is  W  =  50, 000  Hz,  we 
choose  At  =  1/(2 W)  =  1/100,000.  This  produces  the  discrete-time  PSD  shown  in 
Figure  17.17  and  is  given  by  Px(f)  =  (l/(2At))C?x(//At).  (We  have  divided  by  two 
to  obtain  the  usual  two-sided  PSD.  Also,  the  sampling  operation  introduces  a  factor 
of  1/At  [Jackson  1991].)  To  generate  a  realization  of  a  discrete-time  WSS  random 
process  with  PSD  given  in  Figure  17.17  we  will  use  the  AR  model  introduced  in 


588 


CHAPTER  1 7.  WIDE  SENSE  STATIONARY  RANDOM  PROCESSES 


Q  g  I - 1 - 1 - 1 - 1 - i _ i _ i _ i _ i - 1 

-0.5  -0.4  -0.3  -0.2  -0.1  0  0.1  0.2  0.3  0.4  0.5 


/  (cycles/sample) 

Figure  17.17:  Discrete-time  PSD  for  TBL  noise. 


Example  17.5.  From  the  ACS  we  can  determine  values  of  a  and  if  we  know  f~x  [0] 
and  rx[  1]  since 


a  — 


rx[  1] 
rx[0] 


rx [0] (1  -  a2)  =  rx[ 0] 


(17.57) 

(17.58) 


Knowing  a  and  o\j  will  allow  us  to  use  the  defining  recursive  difference  equation, 
X[n]  =  aX[n  —  1]  +  ?7[n],  of  an  AR  random  process  to  generate  the  realization.  To 
obtain  the  first  two  lags  of  the  ACS  we  use  (17.39) 

rx[0]  =  [\Px(f)df 

rx[l]  =  f  1  Px(f)  cos(2tt f)df 

where  Px{f)  1S  given  in  Figure  17.17.  These  can  be  evaluated  numerically  by  re¬ 
placing  the  integrals  with  approximating  sums  to  yield  rx[ 0]  =  1.5169  x  1015  and 
rx[l]  =  4.8483  x  1014.  Then,  using  (17.57)  and  (17.58),  we  have  the  AR  parame¬ 
ters  a  =  0.3196  and  =  1.362  x  1015.  With  these  parameters  the  AR  PSD  (see 
(17.36))  and  the  true  PSD  (shown  in  Figure  17.17)  are  plotted  in  Figure  17.18.  The 
agreement  between  them  is  fairly  good  except  near  /  =  0.  Hence,  with  these  values 
of  the  parameters  a  random  process  realization  could  be  synthesized  within  a  digital 
computer  and  then  converted  to  analog  form  to  drive  the  shaker  table. 
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Figure  17.18:  Discrete-time  PSD  and  its  AR  PSD  model  for  TBL  noise.  The  true 
PSD  is  shown  as  the  dashed  line  and  the  AR  PSD  model  as  the  solid  line. 
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Problems 

17.1  (0)  (w)  A  Bernoulli  random  process  X[n\  for  — oo  <  n  <  oo  consists  of 
independent  random  variables  with  each  random  variable  taking  on  the  values 
+1  and  —1  with  probabilities  p  and  1  —  p,  respectively.  Is  this  random  process 
WSS?  If  it  is  WSS,  find  its  mean  sequence  and  autocorrelation  sequence. 

17.2  (w)  Consider  the  random  process  defined  as  X[n\  =  aoU[n]  +  aiU[n  —  1]  for 
— oo  <  n  <  oo,  where  ao  and  a\  are  constants,  and  U[n\  is  an  IID  random 
process  with  each  U[n]  having  a  mean  of  zero  and  a  variance  of  one.  Is  this 
random  process  WSS?  If  it  is  WSS,  find  its  mean  sequence  and  autocorrelation 
sequence. 

17.3  (w)  A  sinusoidal  random  process  is  defined  as  X[n\  =  Acos(27r/on)  for  -oo  < 
n  <  oo,  where  0  <  /o  <  0.5  is  a  discrete-time  frequency,  and  A  ~  Af{ 0, 1).  Is 
this  random  process  WSS?  If  it  is  WSS,  find  its  mean  sequence  and  autocor¬ 
relation  sequence. 

17.4  (f)  A  WSS  random  process  has  £J[Y[0]]  =  1  and  a  covariance  sequence  cx[nu  ^2] 
28[ri2  —  ni\.  Find  the  ACS  and  plot  it. 

17.5  (^)  (w)  A  random  process  X[n\  for  —00  <  n  <  00  consists  of  independent 
random  variables  with 

1  /  1)  for  n  even 

n  \U(—y/5,y/ 3)  for  n  odd. 

Is  this  random  process  WSS?  Is  it  stationary? 
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17.6  (w)  The  random  processes  X[n\  and  Y[n]  are  both  WSS.  Every  sample  of 
X[n]  is  independent  of  every  sample  of  Y[n\.  Is  Z[n]  =  X[n\  +  Y[n\  WSS?  If 
it  is  WSS,  find  its  mean  sequence  and  autocorrelation  sequence. 

17.7  (w)  The  random  processes  X[n]  and  Y[n]  are  both  WSS.  Every  sample  of 
X[n]  is  independent  of  every  sample  of  Y[n].  Is  Z[n\  —  JA[n]y[n]  WSS?  If  it 
is  WSS,  find  its  mean  sequence  and  autocorrelation  sequence. 

17.8(f)  For  the  ACS  rx[k]  =  (1/2)^  for  k  >  0  and  rx[k]  =  (1/2) ~ ^  for  k  <  0, 
verify  that  Properties  17.1-17.3  are  satisfied. 

17.9  (o)  (w)  For  the  sequence  rx[k]  =  ab M  for  — oo  <  k  <  oo,  determine  the 
values  of  a  and  b  that  will  result  in  a  valid  ACS. 

17.10  (w)  A  periodic  WSS  random  process  with  period  P  is  defined  to  be  a  random 
process  X[n\  whose  ACS  satisfies  rx[k  +  P]  =  rx[k]  for  all  k.  An  example 
is  the  randomly  phased  sinusoid  of  Example  17.10  for  which  P  =  10.  Show 
that  the  correlation  coefficient  for  two  samples  of  a  zero  mean  periodic  random 
process  that  are  separated  by  P  samples  is  one.  Comment  on  the  predictability 
of  X[n  +  P]  based  on  X[n\  =  x[n\. 

17.11  (w)  A  WSS  random  process  has  an  ACS  rx[k]  and  mean  /jl .  Find  the  corre¬ 
lation  coefficient  for  two  samples  of  the  random  process  that  are  separated  by 
k  samples. 

17.12  (^)  (w)  Which  of  the  sequences  in  Figure  17.19  cannot  be  valid  ACSs?  If 
the  sequence  cannot  be  an  ACS,  explain  why  not. 

17.13  (w)  For  the  randomly  phased  sinusoid  described  in  Example  17.4  find  the 
optimal  linear  prediction  of  X[l]  based  on  observing  X[0]  =  rr[0],  and  also  of 
X[10]  based  on  observing  X[0]  =  :r[0].  Can  either  of  these  samples  be  perfectly 
predicted?  Explain  why  or  why  not. 

17.14  (w)  For  the  AR  random  process  described  in  Example  17.10  find  the  optimal 
linear  prediction  of  X[riQ+ko]  based  on  observing  X[no]  =  re  [no].  How  accurate 
is  your  prediction  in  terms  of  MSE  as  ko  increases? 

17.15  (t)  In  this  problem  we  derive  rx[ 0]  for  the  AR  random  process  described  in 
Example  17.5.  To  do  so  assume  that  X[n]  can  be  written  as 

oo 

U[n  -  k],  (17.59) 

k=0 

This  was  shown  to  be  true  in  Example  17.5.  Then  verify  that  rx-[0]  can  be 
written  as 

OO  OO 

rx[Q]  =  *k*lE[U[n  -  k]U[n  -  l ]] 

k= 0 1=0 


592 


CHAPTER  1 7.  WIDE  SENSE  STATIONARY  RANDOM  PROCESSES 


(d)  (e)  (f) 

Figure  17.19:  Possible  ACSs  for  Problem  17.12. 


and  use  the  properties  of  the  U[n]  random  process  to  finish  the  derivation. 

17.16  (t)  Using  a  similar  approach  to  the  one  used  in  Problem  17.15  derive  the 
ACS  for  the  AR  random  process  described  in  Example  17.5.  Hint:  Start  with 
the  definition  of  the  ACS  and  use  (17.59). 

17.17  (^)  (w)  To  generate  a  realization  of  an  AR  process  on  the  computer  we 
can  use  the  recursive  difference  equation  X[n]  =  aX[n  —  1]  +  U[n\  for  n  > 
0.  However,  in  doing  so,  we  soon  realize  that  the  initial  condition  X[—  1] 
is  required.  Assume  that  we  set  X[—l]  =  0  and  use  the  recursion  X[0]  = 
U[0],X[1]  =  aX[ 0]  +  17[1], —  Determine  the  mean  and  variance  of  X[n\  for 
n  >  0,  where  as  usual  U[n\  consists  of  uncorrelated  random  variables  with 
zero  mean  and  variance  cr^.  Does  the  mean  depend  on  n?  Does  the  variance 
depend  on  n?  What  happens  as  n -¥  oo?  Hint:  First  show  that  X[n]  can  be 
written  as  X[n]  =  Ylk= o  akU[n  —  k ]  for  n  >  0. 

17.18  (w)  This  problem  continues  Problem  17.17.  Instead  of  letting  X[—l\  =  0,  set 
X[—  1]  equal  to  a  random  variable  with  mean  0  and  a  variance  of  <7^/(1  —  a2) 
and  that  is  uncorrelated  with  U[n\  for  n  >  0.  Find  the  mean  and  variance  of 
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X[0].  Explain  your  results  and  why  this  makes  sense. 

17.19  (^)  (w)  An  example  of  a  sequence  that  is  not  positive  semidefinite  is  r[ 0]  = 

1,  r[—  1]  =  r[  1]  =  —7/8  and  equals  zero  otherwise.  Compute  the  determinant 
of  the  lxl  principal  minor,  the  2x2  principal  minor,  and  the  3x3  principal 
minor  of  the  3x3  autocorrelation  matrix  Rx  using  these  values.  Also,  plot 
the  discrete-time  Fourier  transform  of  r[k].  Why  do  you  think  the  positive 
semidefinite  property  is  important? 

17.20  (^)  (w)  For  the  general  MA  random  process  of  Example  17.6  show  that  the 
process  is  WSS. 

17.21  (f)  Use  (17.28)  to  show  that  the  MA  random  process  defined  in  Example 
17.6  is  ergodic  in  the  mean. 

17.22  (t,f)  Show  that  a  WSS  random  process  whose  ACS  satisfies  rx[k]  =  for 
k  >  ko  >  0  must  be  ergodic  in  the  mean. 

17.23  (t)  Prove  (17.28)  by  using  the  relationship 

N-1N-1  N—l 

X  X ~  ft  =  X  (N- 

i= 0  j=0  k=-(N- 1) 

Try  verifying  this  relationship  for  N  =  3. 

17.24  (f)  For  the  random  DC  level  defined  in  Example  17.7  prove  that  var(/i;\')  =  1. 

17.25  (f)  Explain  why  the  randomly  phased  sinusoid  defined  in  Example  17.4  is 
ergodic  in  the  mean.  Next  show  that  it  is  ergodic  in  the  ACS  in  that 

1  N-l-k 

lim  fx[k]  =  lim  — — -  X[ri\X[n+k\  =  -  cos(27t(0.1)A:)  =  rx[k]  k  >  0 

AT-400  TV— >oo  N  -k  '  ljl  j  2  ’  ’  ^LJ  - 

n— 0 

by  computing  rx[k]  directly.  Hint:  Use  the  fact  that 

limjv->oo(l/(A^  —  k))  JJn=o1~k  cos(27 xfn  +  (j>)  =  0  for  any  0  <  /  <  1  and  any 
phase  angle  <f>.  This  is  because  the  temporal  average  of  an  infinite  duration 
sinusoid  is  zero. 

17.26  (t)  Show  that  the  formula 

MM  2  M 

X  X  9[m-n]=  X  (2M  +  1  “  l*l)ff[fc] 

m——Mn——M  k=—2M 


is  true  for  M  =  1. 
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17.27  (t)  Argue  that 


lim 

M—>  oo 


1*1 


2M  +  1 


rx[k]exp(-j2irfk) 


w[k] 


OO 

=  ^  rx[k]exp(-j2nfk) 

k— — oo 


by  drawing  pictures  of  rx[h],  which  decays  to  zero,  and  overlay  it  with  w[k] 
as  M  increases. 


17.28  (^)  (w)  For  the  differenced  random  process  defined  in  Example  17.1  deter¬ 
mine  the  PSD.  Explain  your  results. 

17.29  (f)  Determine  the  PSD  for  the  randomly  phased  sinusoid  described  in  Exam¬ 
ple  17.4.  Is  this  result  reasonable?  Hint:  The  discrete-time  Fourier  transform 
of  exp(j27r/on)  for  —1/2  <  /o  <  1/2  is  5(f  —  fo)  over  the  frequency  interval 
-1/2  <  /  <  1/2. 

17.30  (^)  (w)  A  random  process  is  defined  as  X[n\  =  AU[n\,  where  A  ~  J\f( 0,  cr\) 
and  U[n\  is  white  noise  with  variance  afj.  The  random  variable  A  is  indepen¬ 
dent  of  all  the  samples  of  U[n\.  Determine  the  PSD  of  X[n\. 

17.31  (w)  Find  the  PSD  of  the  random  process  X[n\  =  (l/2)lnl?7[n]  for  — oo  <  n  < 
oo,  where  U[n\  is  white  noise  with  variance  cr^. 

17.32  (w)  Find  the  PSD  of  the  random  process  X[n]  =  aoU[n]  +  a\U[n  —  1],  where 
ao,ai  are  constants  and  U[n\  is  white  noise  with  variance  afj  =  1. 

17.33  (w)  A  Bernoulli  random  process  consists  of  IID  Bernoulli  random  variables 
taking  on  values  +1  and  —1  with  equal  probabilities.  Determine  the  PSD  and 
explain  your  results. 

17.34  (^)  (w)  A  random  process  is  defined  as  X[n\  =  U[n]  +  /jl  for  — oo  <  n  <  oo, 
where  U[n\  is  white  noise  with  variance  afj.  Find  the  ACS  and  PSD  and  plot 
your  results. 

17.35  (w,c)  Consider  the  AR  random  process  defined  in  Example  17.5  and  de¬ 
scribed  further  in  Example  17.10  with  —1  <  a  <  0  and  for  some  >  0.  Plot 
the  PSD  for  several  values  of  a  and  explain  your  results. 

17.36  (f,c)  Plot  the  corresponding  PSD  for  the  ACS 


rx[k]  =  < 


k  =  0 
k  =  ±  1 
k  =  ±2 
otherwise. 


PROBLEMS 


595 


17.37  (w)  If  a  random  process  has  the  PSD  Px(f)  =  1  +  cos(27t/),  are  the  samples 
of  the  random  process  uncorrelated? 


17.38  (o)  (f)  ^  a  random  process  has  the  PSD  Px(f)  = 
(1/2) exp(— j47r/)|2,  determine  the  ACS. 


1  +  exp(-j27r/)  + 


17.39  (c)  For  the  AR  random  processes  whose  ACSs  are  shown  in  Figure  17.6 
generate  a  realization  of  N  =  2000  samples  for  each  process.  Use  the  MATLAB 
code  segment  given  in  Section  17.4  to  do  this.  Then,  estimate  the  ACS  for 
k  =  0, 1, . . . ,  30  and  plot  the  results.  Compare  your  results  to  those  shown  in 
Figure  17.12  and  explain. 


17.40  (^)  (w)  A  PSD  is  given  as  Px(f)  =  a  +  b cos(2tt f)  for  some  constants  a  and 
b.  What  values  of  a  and  b  will  result  in  a  valid  PSD? 


17.41  (f)  A  PSD  is  given  as 


Px(f)  = 


2-8/  0  <  /  <  1/4 
0  1/4  <  /  <  1/2. 


Plot  the  PSD  and  find  the  total  average  power  in  the  random  process. 

17.42  (^)  (c)  Plot  50  realizations  of  the  randomly  phased  sinusoid  described  in 
Example  17.4  with  N  —  50,  and  overlay  the  samples  in  a  scatter  diagram  plot 
such  as  shown  in  Figure  16.15.  Explain  the  results  by  referring  to  the  PDF  of 
Figure  16.12.  .  Next  estimate  the  following  quantities:  E[X[10]],  E'[X[12]], 
E[A[10]X[12]],  F7[X[12]X[14]]  by  averaging  down  the  ensemble,  and  compare 
your  simulated  results  to  the  theoretical  values. 

17.43  (c)  In  this  problem  we  support  the  results  of  Problem  17.18  by  using  a  com¬ 

puter  simulation.  Specifically,  generate  M  =  10, 000  realizations  of  the  AR 
random  process  X[n]  =  0.95X[n  —  1]  +  U[n\  for  n  =  0, 1, . . . ,  49,  where  U[n\  is 
WGN  with  (jjj  —  1 .  Do  so  two  ways:  for  the  first  set  of  realizations  let  X[—l]  = 
0  and  for  the  second  set  of  realizations  let  X[— 1]  •V (0,  afj/ (1  —  a2)),  using  a 

different  random  variable  for  each  realization.  Now  estimate  the  variance  for 
each  sample  time  n,  which  is  r*x[0],  by  averaging  X2[n]  down  the  ensemble  of 
realizations.  Do  you  obtain  the  theoretical  result  of  Of[0]  =  ofr/(l  -  a2)? 

17.44  (^)  (c)  Generate  a  realization  of  discrete-time  white  Gaussian  noise  with 
variance  a\  —  1.  For  N  =  64,  N  =  128,  and  N  =  256,  plot  the  periodogram. 
What  is  the  true  PSD?  Does  your  estimated  PSD  get  closer  to  the  true  PSD 
as  N  increases?  If  not,  how  could  you  improve  your  estimate? 

17.45  (c)  Generate  a  realization  of  an  AR  random  process  of  length  N  —  31,000 
with  a  =  0.25  and  —  1— a2.  Break  up  the  data  set  into  1000  nonoverlapping 
blocks  of  data  and  compute  the  periodogram  for  each  block.  Finally,  average 
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the  periodograms  together  for  each  point  in  frequency  to  determine  the  final 
averaged  periodogram  estimate.  Compare  your  results  to  the  theoretical  PSD 
shown  in  Figure  17.11a. 

17.46(f)  A  continuous-time  randomly  phased  sinusoid  is  defined  by  X(t)  = 
cos(27tFo£  +  ©),  where  ©  ~  U( 0, 2i r).  Determine  the  mean  function  and  ACF 
for  this  random  process. 

17.47  (^)  (f)  For  the  PSD  Px(F)  =  exp(— |F|),  determine  the  average  power  in 
the  band  [10, 100]  Hz. 

17.48  (w)  If  a  PSD  is  given  as  Px(F)  =  exp(— |F/Fo|),  what  happens  to  the  ACF 
as  Fo  increases  and  also  as  Fo  — >  oo? 

17.49  (t)  Based  on  (17.49)  derive  (17.50),  and  also  based  on  (17.51)  derive  (17.52). 

17.50  (0)  (w)  A  continuous-time  white  noise  random  process  U ( t )  whose  PSD  is 
given  as  Pjj{F)  —  ATq/2  is  integrated  to  form  the  continuous-time  MA  random 
process 

m  =  \  [  u(t)dt. 

«/ 1 — T 

Determine  the  mean  function  and  the  variance  of  X  ( t ) .  Does  X  ( t )  have  infinite 
total  average  power? 

17.51  (^)  (w,c)  Consider  a  continuous-time  random  process  X(t)  =  /jl  +  U(t), 
where  U(t)  is  zero  mean  and  has  the  ACF  given  in  Figure  17.15.  If  X(t)  is 
sampled  at  twice  the  Nyquist  rate,  which  is  Fs  =  4IF,  determine  the  ACS  of 
X[n\.  Next  using  (17.28)  find  the  variance  of  the  sample  mean  estimator 

for  iV  =  20.  Is  it  half  of  the  variance  of  the  sample  mean  estimator  if  we  had 
sampled  at  the  Nyquist  rate  and  used  N  =  10  samples  in  our  estimate?  Note 
that  in  either  case  the  total  length  of  the  data  interval  in  seconds  is  the  same, 
which  is  20/(4W0  =  10/(2 W). 

17.52  (f)  A  PSD  is  given  as 


Px(f)  = 


1  +  \  exp(-j2ivf) 


Model  this  PSD  by  using  an  AR  PSD  as  was  done  in  Section  17.9.  Plot  the 
true  PSD  and  the  AR  model  PSD. 


Chapter  18 


Linear  Systems  and  Wide  Sense 
Stationary  Random  Processes 

18.1  Introduction 

Most  physical  systems  are  conveniently  modeled  by  a  linear  system.  These  include 
electrical  circuits,  mechanical  machines,  human  biological  functions,  and  chemical 
reactions,  just  to  name  a  few.  When  the  system  is  capable  of  responding  to  a 
continuous-time  input,  its  effect  can  be  described  using  a  linear  differential  equation. 
For  a  system  that  responds  to  a  discrete-time  input  a  linear  difference  equation 
can  be  used  to  characterize  the  effect  of  the  system.  Furthermore,  for  systems 
whose  characteristics  do  not  change  with  time,  the  coefficients  of  the  differential  or 
difference  equation  are  constants.  Such  a  system  is  termed  a  linear  time  invariant 
(LTI)  system  for  continuous-time  inputs/outputs  and  a  linear  shift  invariant  (LSI) 
system  for  discrete- time  inputs/outputs.  In  this  chapter  we  explore  the  effect  of  these 
systems  on  wide  sense  stationary  (WSS)  random  process  inputs.  The  reader  who  is 
unfamiliar  with  the  basic  concepts  of  linear  systems  should  first  read  Appendix  D  for 
a  brief  introduction.  Many  excellent  books  are  available  to  supplement  this  material 
[Jackson  1991,  Oppenheim,  Willsky,  and  Nawab  1997,  Poularikas  and  Seely  1985]. 
We  will  now  consider  only  discrete-time  systems  and  discrete-time  WSS  random 
processes.  A  summary  of  the  analogous  concepts  for  the  continuous-time  case  is 
given  in  Section  18.6. 

The  importance  of  LSI  systems  is  that  they  maintain  the  wide  sense  stationarity 
of  the  random  process.  That  is  to  say,  if  the  input  to  an  LSI  system  is  a  WSS 
random  process,  then  the  output  is  also  a  WSS  random  process.  The  mean  and  ACS, 
or  equivalently  the  PSD,  however,  are  modified  by  the  action  of  the  system.  We  will 
be  able  to  obtain  simple  formulas  yielding  these  quantities  at  the  system  output.  In 
effect,  the  linear  system  modifies  the  first  two  moments  of  the  random  process  but 
in  an  easily  determined  and  intuitively  pleasing  way.  This  allows  us  to  assess  the 
effect  of  a  linear  system  on  a  WSS  random  process  and  therefore  provides  a  means 
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to  produce  a  WSS  random  process  at  the  output  with  some  desired  characteristics. 
Furthermore,  the  theory  is  easily  extended  to  the  case  of  multiple  random  processes 
and  multiple  linear  systems  as  we  will  see  in  the  next  chapter. 

18.2  Summary 

For  the  linear  shift  invariant  system  shown  in  Figure  18.1  the  output  random  process 
is  given  by  (18.2).  If  the  input  random  process  is  WSS,  then  the  output  random 
process  is  also  WSS.  The  output  random  process  has  a  mean  given  by  (18.9),  an  ACS 
given  by  (18.10),  and  a  PSD  given  by  (18.11).  If  the  input  WSS  random  process 
is  white  noise,  then  the  output  random  process  has  the  ACS  of  (18.15).  In  Section 
18.4  the  PSD  is  interpreted,  using  the  results  of  Theorem  18.3.1,  as  the  average 
power  in  a  narrow  frequency  band  divided  by  the  width  of  the  frequency  band.  The 
application  of  discrete-time  linear  systems  to  estimation  of  samples  of  a  random 
process  is  explored  in  Section  18.5.  Generically  known  as  Wiener  filtering,  there  are 
four  separate  problems  defined,  of  which  the  smoothing  and  prediction  problems 
are  solved.  For  smoothing  of  a  random  process  signal  in  noise  the  estimate  is  given 
by  (18.20)  and  the  optimal  filter  has  the  frequency  response  of  (18.25).  A  specific 
application  is  given  in  Example  18.4  to  estimation  of  an  AR  signal  that  has  been 
corrupted  by  white  noise.  The  minimum  MSE  of  the  optimal  Wiener  smoother 
is  given  by  (18.27).  One-step  linear  prediction  of  a  random  process  sample  based 
on  the  current  and  all  past  samples  as  given  by  (18.21)  leads  to  the  optimal  filter 
impulse  response  satisfying  the  infinite  set  of  linear  equations  of  (18.28).  The  general 
solution  is  summarized  in  Section  18.5.2  and  then  illustrated  in  Example  18.6.  For 
linear  prediction  based  on  the  current  sample  and  a  finite  number  of  past  samples 
the  optimal  impulse  response  is  given  by  the  solution  of  the  Wiener-Hopf  equations 
of  (18.36).  The  corresponding  minimum  MSE  is  given  by  (18.37).  In  particular,  if 
the  random  process  is  an  AR  random  process  of  order  p,  the  Wiener-Hopf  equations 
are  the  same  as  the  Yule- Walker  equations  of  (18.38)  and  the  minimum  mean  square 
error  equation  of  (18.37)  is  the  same  as  for  the  white  noise  variance  of  (18.39).  In 
Section  18.6  the  corresponding  formulas  for  a  continuous-time  random  process  that 
is  input  to  a  linear  time  invariant  system  are  summarized.  The  mean  at  the  output 
is  given  by  (18.40),  the  ACF  is  given  by  (18.41),  and  the  PSD  is  given  by  (18.42). 
Example  18.7  illustrates  the  use  of  these  formulas.  In  Section  18.7  the  application 
of  AR  random  process  modeling  to  speech  synthesis  is  described.  In  particular,  it 
is  shown  how  a  segment  of  speech  can  first  be  modeled,  and  then  how  for  an  actual 
segment  of  speech,  the  parameters  of  the  model  can  be  extracted.  The  model  with 
its  estimated  parameters  can  then  be  used  for  speech  synthesis. 

18.3  Random  Process  at  Output  of  Linear  System 

We  wish  to  consider  the  effect  of  an  LSI  system  on  a  discrete-time  WSS  random 
process.  We  will  from  time  to  time  refer  to  the  linear  system  as  a  filter ,  a  term  that 
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is  synonomous.  In  Section  18.6  we  summarize  the  results  for  a  continuous-time  WSS 
random  process  that  is  input  to  an  LTI  system.  To  proceed,  let  U[n]  be  the  WSS 
random  process  input  and  X[n\  be  the  random  process  output  of  the  system.  We 
generally  represent  an  LSI  system  schematically  with  its  input  and  output  as  shown 
in  Figure  18.1.  Previously,  in  Chapters  16  and  17  we  have  seen  several  examples 


U[n] 


n 


Linear  shift 

invariant 

-1,0,1,... 

system 

X[n] 


n 


1,0,1,... 


Figure  18.1:  Linear  shift  invariant  system  with  random  process  input  and  output. 


of  LSI  systems  with  WSS  random  process  inputs.  One  example  is  the  MA  random 
process  (see  Example  16.7)  for  which  X[n]  =  ( U[n ]  +  U[n  —  l])/2,  with  U[n]  a  white 
Gaussian  noise  process  with  variance  afj.  (Recall  that  discrete-time  white  noise  is 
a  zero  mean  WSS  random  process  with  ACS  ru[k]  =  cr^[fc].)  We  may  view  the 
MA  random  process  as  the  output  X[n]  of  an  LSI  filter  excited  at  the  input  by  the 
white  Gaussian  noise  random  process  U[n\.  (In  this  chapter  we  will  be  considering 
only  the  first  two  moments  of  X[n],  That  U[n\  is  a  random  process  consisting  of 
Gaussian  random  variables  is  of  no  consequence  to  these  discussions.  The  same 
results  are  obtained  for  any  white  noise  random  process  U[n\  irregardless  of  the 
marginal  PDFs.  In  Chapter  20,  however,  we  will  consider  the  joint  PDF  of  samples 
of  A[n],  and  in  that  case,  the  fact  that  U[n\  is  white  Gaussian  noise  will  be  very 
important.)  The  averaging  operation  can  be  thought  of  as  a  filtering  by  the  LSI 
filter  having  an  impulse  response 


(  l 

2 

h[k]  =  <  \ 

l  0 


k  =  0 
k  —  1 
otherwise. 


(18.1) 


(Recall  that  the  impulse  response  h[n\  is  the  output  of  the  LSI  system  when  the 
input  u[n\  is  a  unit  impulse  <5[n].)  This  is  because  the  output  of  an  LSI  filter  is 
obtained  using  the  convolution  sum  formula 

oo 

X[n}=  ^  h[k\U[n  —  k] 

k=— oo 


so  that  upon  using  (18.1)  in  (18.2)  we  have 


X[n ]  —  h[0]U[n]  +  h[l]U[n  —  1] 
=  \u[n]  +  \u[n- 1] 

=  \(U[n]  +  U[n-  1]). 


(18.2) 
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In  general,  the  LSI  system  will  be  specified  by  giving  its  impulse  response  h[k\  for 
— oo  <  k  <  oo  or  equivalently  by  giving  its  system  function ,  which  is  defined  as  the 
^-transform  of  the  impulse  response.  The  system  function  is  thus  given  by 


oo 


'H(z)  =  ^2  h[k\z 


-k 


(18.3) 


k—— oo 


oo 


In  addition,  we  will  have  need  for  the  frequency  response  of  the  LSI  system,  which  is 
defined  as  the  discrete-time  Fourier  transform  of  the  impulse  response.  It  is  therefore 
given  by 

H(f)=  J2  h[k]exp(-j2nfk).  (18.4) 

k— — oo 

This  function  assesses  the  effect  of  the  system  on  a  complex  sinusoidal  input  sequence 
u[n]  =  exp(j27r/on)  for  —  oo  <  n  <  oo.  It  can  be  shown  that  the  response  of  the 
system  to  this  input  is  x[n]  =  H(fo)  exp(j27r/on)  =  H(fo)u[n\  (use  (18.2)  with  the 
deterministic  input  u[n\  =  exp(j27r/on)).  Hence,  its  name  derives  from  the  fact  that 
the  system  action  is  to  modify  the  amplitude  of  the  complex  sinusoid  by  |l?(/o)|  and 
the  phase  of  the  complex  sinusoid  by  ZiJ(/o),  but  otherwise  retains  the  complex 
sinusoidal  sequence.  It  should  also  be  noted  that  the  frequency  response  is  easily 
obtained  from  the  system  function  as  H(f )  =  ^(exp^Tr/)).  For  the  MA  random 
process  we  have  upon  using  (18.1)  in  (18.3)  that  the  system  function  is 

H(z) = \ + k1 

and  the  frequency  response  is  the  system  function  when  2  is  replaced  by  exp(j’27r/), 
yielding 

H(f)  =  |  +  1  exp(— j2irf). 

It  is  said  that  the  system  function  has  been  evaluated  “on  the  unit  circle  in  the 
z-plane” . 

We  next  give  an  example  to  determine  the  characteristics  of  the  output  random 
process  of  an  LSI  system  with  a  WSS  input  random  process.  The  previous  example 
is  generalized  slightly  to  prepare  for  the  theorem  to  follow. 

Example  18.1  -  Output  random  process  characteristics 

Let  U[n\  be  a  WSS  random  process  with  mean  pu  and  ACS  rjj[k].  This  random 
process  is  input  to  an  LSI  system  with  impulse  response 

h[0]  k  =  0 
h[k]  =  {  h[  1]  k  —  1 

0  otherwise. 


This  linear  system  is  called  a  finite  impulse  response  (FIR)  filter  since  its  impulse 
response  has  only  a  finite  number  of  nonzero  samples.  We  wish  to  determine  if 
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a.  the  output  random  process  is  WSS  and  if  so 

b.  its  mean  sequence  and  ACS. 

The  output  of  the  linear  system  is  from  (18.2) 

X[n]  =  h[0]U[n]  +  h[l]U[n  -  1]. 

The  mean  sequence  is  found  as 

E[X[n}\  =  h[0}E[U[n}}  +  h[l]E[U[n  -  1]] 

=  h[0\/j,u  +  h[T\nu 

=  (MO]  +  Mi])/*# 

so  that  the  mean  is  constant  with  time  and  is  given  by 

ixx  =  (MO]  +Mi]W- 

It  can  also  be  written  from  (18.4)  as 

oo 

Hx  =  ^2  h[k]exp(-j2nfk)  fiV  =  H(0)/j,u. 

h— — oo  y_Q 

The  mean  at  the  output  of  the  LSI  system  is  seen  to  be  modified  by  the  frequency 
response  evaluated  at  /  =  0.  Does  this  seem  reasonable?  Next,  if  E[X[n]X[n  +  k]]  is 
found  not  to  depend  on  n,  we  will  be  able  to  conclude  that  X[n\  is  WSS.  Continuing 
we  have 

E[X[n]X[n  +  k]]  =  E[(h[0]U[n]  +  h[l]U[n  -  l])(h[0]U[n  +  k}  +  h[l]U[n  +  k  —  1])] 

=  h2[0]E[U[n]U[n  +  k]]  +  h[0]h[l]E[U[n]U[n  +  k  -  1]] 

+  h[l]h[0]E[U[n  -  1  ]U[n  +  k]]  +  h2[l]E[U[n  -l]U[n  +  k-  1]] 
=  (/i2[0]  +  h2[l])ru[k]  +  h[0]h[l]ru[k  -  1]  +  h[l]h[0]ru[k  +  1] 

and  is  seen  not  to  depend  on  n.  Hence,  X [n]  is  WSS  and  its  ACS  is 

rx[k]  =  (/i2[0]  +  h2[l])ru[k]  +  h[0]h[l]ru[k  —  1]  +  h[l]h[0)ru[k  +  1].  (18.5) 

❖ 

Using  the  previous  example  for  sake  of  illustration,  we  next  show  that  the  ACS  of  the 
output  random  process  of  an  LSI  system  can  be  written  as  a  multiple  convolution 
of  sequences.  To  do  so  consider  (18.5)  and  let 

3[0]  =  h2[0]  +  h2[l] 
fl[l]  =  h[0]h[l] 
ff[-l]  =  h[l]h[0] 
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and  zero  otherwise.  Then 


rx[k]  =  g[0]ru[k]  +  g[l]ru[k  -  1]  +  g[-l]ru[k  +  1] 

l 

=  9[j]ru[k-j] 

j=- 1 

=  9[k]  *  ru[k ]  (definition  of  convolution  sum)  (18.6) 

where  *  denotes  convolution.  Also,  it  is  easily  shown  by  direct  computation  that 

o 

g[k]  =  ^  h[-j)h[k  -  j] 

3=- 1 

=  h[-k]*h[k]  (18.7) 

and  therefore  from  (18.6)  and  (18.7)  we  have  the  final  result 

rx[k\  =  (h[—k]*h[k])'kru[k\ 

=  h[—k]*h[k\*ru[k].  (18.8) 

The  parentheses  can  be  omitted  in  (18.8)  since  the  order  in  which  the  convolu¬ 

tions  are  carried  out  is  immaterial  (due  to  associative  and  commutative  property  of 
convolution) . 

To  find  the  PSD  of  X[n]  we  note  from  (18.4)  that  the  Fourier  transform  of  the 
impulse  response  is  the  frequency  response  and  therefore 


P{h[k]}  =  H(f) 
F{h[-k}}  =  H*(f) 


where  T  indicates  the  discrete-time  Fourier  transform.  Fourier  transforming  (18.8) 
produces 

Px(f)  =  H*(f)H(f)Pu(f) 


or  finally  we  have 


Px(f)  =  \H(f)\2Pu(f). 


This  is  the  fundamental  relationship  for  the  PSD  at  the  output  of  an  LSI  system — the 
output  PSD  is  the  input  PSD  multiplied  by  the  magnitude- squared  of  the  frequency 
response.  We  summarize  the  foregoing  results  in  a  theorem. 


Theorem  18.3.1  (Random  Process  Characteristics  at  LSI  System  Output) 

If  a  WSS  random  process  U[n]  with  mean  nu  and  ACS  rjj[k ]  is  input  to  an  LSI 
system  which  has  an  impulse  response  h[k\  and  frequency  response  H(f),  then  the 
output  random  process  X[n]  =  YlkL-oo  h[k]U[n  —  k]  is  also  WSS  and 

oo 

nx  =  ^2  h[k\nu  =  H(0)nu  (18.9) 

k=— oo 

rx[k]  =  h[—k]*h[k\*ru[k]  (18.10) 

Px(f)  =  \H(f)\2Pu(f).  (18.11) 
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Proof:  The  mean  sequence  at  the  output  is 


»x[n]  =  E[X[n]]  =  E 


OO 


h[k]U[n  —  k] 


Lk=—oo 


OO 


=  Y  h[k]E[U[n  -  k]] 


k=— oo 

OO 


=  Y  #  (°W  ( U[n ]  is  WSS) 


k— — oo 


and  is  not  dependent  on  n.  To  determine  if  an  ACS  can  be  defined,  we  consider 
E[X[n]X[n  +  k]].  This  becomes 


E[X[n]X[n  +  k]]  =  E 


oo 


oo 


Y  h\i]u[n-i]  Y  h[j]u[n  +  k-j] 


i— — OO 

oo  oo 


J  =  -  OO 


=  Y  ^2  h[i]h[j]  E[U[n  -  i]U[n  +  k  -  j]] 

.  N - v - ' 

l—  —  OO  J  — —  OO  r,  .  , 

ru[k-J+i] 

since  U[n]  was  assumed  to  be  WSS.  It  is  seen  that  there  is  no  dependence  on  n  and 
hence  X[n]  is  WSS.  The  ACS  is 


oo  oo 


rx[k]  =  Y  k\i}h[j}ru[(k  +  i)  -  j] 


l—  —  OO  J  —  —  oo 

oo  oo 


=  Y  Y  hWu[{k  +  i)  -  j ] 


l—  —  OO  J  —  —  OC 


9[k+i] 


where 


Now  we  have 


g[m\  =  h[m\  *ru[m\. 


(18.12) 


oo 


rx[k]  =  Y  h\i)9[k  +  i] 


l—  —  OO 

oo 


=  ^  h[— l]g[k  —  i]  (let  l  =  —i) 

1—  —  00 

=  /i[— &]  *g[A;]. 

But  from  (18.12)  p[fe]  =  /i[fc]  ★rc/[fc]  and  therefore 

rx[k]  =  h[— k]  *  (h[k\  *rjj[k]) 

=  h[—k]  ★  ★  777  [fc] 


(18.13) 
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due  to  the  associate  and  commutative  properties  of  convolution.  The  last  re¬ 
sult  of  (18.11)  follows  by  taking  the  Fourier  transform  of  (18.13)  and  noting  that 
F{h[-k]}  =  H*(f). 

A 

A  special  case  of  particular  interest  occurs  when  the  input  to  the  system  is  white 
noise.  Then  using  Pu(f)  =  al:  in  (18.11),  the  output  PSD  becomes 

Px(f)  =  \H(f)\2a2u.  (18.14) 

Using  ru[k]  =  in  (18.10),  the  output  ACS  becomes 

rx[k]  —  h[—k]  *  h[k]  *  <r^<5[fc] 


and  noting  that  h[k]  *  <$[&]  =  h[k] 

rx  [k]  =  Oijh[—k\  ★  h[k\ 

oo 

=  cfjj  ^2  h[—i\h[k  —  i\. 

i=— oo 


Finally,  letting  m  =  —i  we  have  the  result 

oo 

rx[k]  =  afj  E  h[m]h[m  +  k]  —  oo  <  k  <  oo.  (18.15) 

m—— oo 


This  formula  is  useful  for  determining  the  output  ACS,  as  is  illustrated  next. 

Example  18.2  —  AR  random  process 

In  Examples  17.5  and  17.10  we  derived  the  ACS  and  PSD  for  an  AR  random 
process.  We  now  rederive  these  quantities  using  the  linear  systems  concepts  just 
described.  Recall  that  an  AR  random  process  is  defined  as  X[n]  —  aX[n  —  1]  +  U[n] 
and  can  be  viewed  as  the  output  of  an  LSI  filter  with  system  function 


H(z) 


1 

1  —  az~l 


with  white  Gaussian  noise  U[n]  at  the  input.  This  is  shown  in  Figure  18.2  and 
follows  from  the  definition  of  the  system  function  'H(z)  as  the  z- transform  of  the 
output  sequence  divided  by  the  z-transform  of  the  input  sequence.  To  see  this 
let  u[n]  be  a  deterministic  input  sequence  with  z-transform  U(z)  and  ®[n]  be  the 
corresponding  deterministic  output  sequence  with  z-transform  X{z).  Then  we  have 
by  the  definition  of  the  system  function 
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n(z) 


n(z)  =  T =S=r 

Figure  18.2:  Linear  system  model  for  AR  random  process.  The  input  random 
process  U[n]  is  white  Gaussian  noise  with  variance 

and  therefore  for  the  given  system  function 

X(z)  = 


%{z)U{z) 

r^FT^)- 


Thus, 


X(z)  —  az  1X(z)=U(z) 

and  taking  the  inverse  ^-transform  yields  the  recursive  difference  equation 

x[n]  —  ax[n  —  1]  =  u[n\ 


(18.16) 


which  is  equivalent  to  our  AR  random  process  definition  when  the  input  and  output 
sequences  are  replaced  by  random  processes. 

The  output  PSD  is  now  found  by  using  (18.14)  to  yield 


Px(f)  =  |ft(exp(j27r/))|2ofr 

=  °A 

1  -  aexp(-j27r/)|2 


(18.17) 


which  agrees  with  our  previous  results.  To  determine  the  ACS  we  can  either  take  the 
inverse  Fourier  transform  of  (18.17)  or  use  (18.15).  The  latter  approach  is  generally 
easier.  To  find  the  impulse  response  we  can  use  (18.16)  with  the  input  set  to  <5[n]  so 
that  the  output  is  by  definition  h[n].  Since  the  LSI  system  is  assumed  to  be  causal, 
we  need  to  determine  the  solution  of  the  difference  equation  h[n]  =  ah[n  —  1]  +  £[n] 
for  n  >  0  with  initial  condition  h[—  1]  =  0.  The  reason  that  the  initial  condition  is 
set  equal  to  zero  is  our  assumption  that  the  LSI  system  is  causal.  A  causal  system 
cannot  produce  an  output  which  is  nonzero,  in  this  case  7z[ — 1],  before  the  input  is 
applied,  in  this  case  at  n  =  0  since  the  input  is  6 [n] .  This  produces  h[n]  =  an us  [n] , 
where  we  now  use  u$[n]  to  denote  the  unit  step  in  order  to  avoid  confusion  with 
the  random  process  realization  u[n]  (see  Appendix  D.3).  Thus,  (18.15)  becomes  for 
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k>0 


OO 

rx[k]  =  crjj  ^  amus[m]aTn+kus[m  +  k] 


m=— oo 

OO 


_2  k 


a 


2m 


m=0 


a 


a * 


and  therefore  for  all  k 


(m  >  0  and  m  +  k  >  0  for  nonzero  term  in  sum) 
(since  \a\  <  1) 


rx[k\  =  crfj 


alfcl 

T^a 2- 


Again  the  ACS  is  in  agreement  with  our  previous  results.  Note  that  the  linear 
system  shown  in  Figure  18.2  is  called  an  infinite  impulse  response  (HR)  filter.  This 
is  because  the  impulse  response  h[n]  =  anus[n]  is  infinite  in  length. 

0 


Fourier  and  transforms  of  WSS  random  process  don’t  exist. 


To  determine  the  system  function  in  the  previous  example  we  assumed  the  input 
to  the  linear  system  was  a  deterministic  sequence  u[n\.  The  corresponding  output 
x[n],  therefore,  was  also  a  deterministic  sequence.  This  is  because  formally  the  z- 
transform  (and  also  the  Fourier  transform)  cannot  exist  for  a  WSS  random  process. 
Existence  requires  the  sequence  to  decay  to  zero  as  time  becomes  large.  But  of 
course  if  the  random  process  is  WSS,  then  we  know  that  E[X2[n]]  is  constant  as 
n  ±oo  and  so  we  cannot  have  \X[n]\  -»  0  as  n  — >  ±oo. 


Example  18.3  —  MA  random  process 

In  Example  17.3  we  derived  the  ACS  for  an  MA  random  process.  We  now  show 
how  to  accomplish  this  more  easily  using  (18.15).  Recall  the  definition  of  the  MA 
random  process  in  Example  17.3  as  X[n\  —  ( U[n ]  +  U[n  —  1]) /2,  with  U[n\  being 
white  Gaussian  noise.  This  may  be  interpreted  as  the  output  of  an  LSI  filter  with 
white  Gaussian  noise  at  the  input.  In  fact,  it  should  now  be  obvious  that  the  system 
function  is  Ti{z)  =  1/2  +  (1/2 )z~l  and  therefore  the  impulse  response  is  h[m]  =  1/2 
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for  m  =  0, 1  and  zero  otherwise.  Using  (18.15)  we  have 

OO 


rx[k] 

and  so  for  k  >  0 

rx[k]  -  < 

< 

Finally,  we  have 

f 

rx[k]  =  < 


=  °u  22  h[m]h[m  +  k] 

m=— oo 
1 

=  auY,  h[m]h[m  +  k\ 

m= 0 


au  Em=o  h2  H  k  =  0 

auYL= oh[m]h[m  +  l]  k  =  l 
0  k>2. 

vtiih)2  +  (l)2}  =  4 /2  k  =  0 
4(5X5)  =  4/4  k  =  l 

0  k  >  2 


which  is  the  same  as  previously  obtained. 


18.4  Interpretation  of  the  PSD 


We  are  now  in  a  position  to  prove  that  the  PSD,  when  integrated  over  a  band  of 
frequencies  yields  the  average  power  within  that  band.  In  doing  so,  the  PSD  may 
then  be  interpreted  as  the  average  power  per  unit  frequency.  We  next  consider 
a  method  to  measure  the  average  power  of  a  WSS  random  process  within  a  very 
narrow  band  of  frequencies.  To  do  so  we  filter  the  random  process  with  an  ideal 
narrowband  filter  whose  frequency  response  is 


H{f)  = 


1  —fo  —  ^  /  <  — /o  +  /o“^<  /  ^/o+%^ 

0  otherwise 


and  which  is  shown  in  Figure  18.3a.  The  width  of  the  passband  of  the  filter  A /  is 
assumed  to  be  very  small.  If  a  WSS  random  process  X[n]  is  input  to  this  filter,  then 
the  output  WSS  random  process  Y[n\  will  be  composed  of  frequency  components 
within  the  A /  frequency  band,  the  remaining  ones  having  been  “filtered  out” .  The 
total  average  power  in  the  output  random  process  Y[n]  (which  is  WSS  by  Theorem 
18.3.1)  is  ry[0]  and  represents  the  sum  of  the  average  powers  in  X[n]  within  the 
bands  [—fo  —  A// 2,  —fo  +  A//2]  and  [fo  —  A//2,  fo  +  A//2].  It  can  be  found  from 

vy[0]  =  f  i  Py(f)df  (from  (17.38)). 

“  2 
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Figure  18.3:  Narrowband  filtering  of  random  process  to  measure  power  within  a 
band  of  frequencies. 


Now  using  (18.11)  and  the  definition  of  the  narrowband  filter  frequency  response  we 
have 


rr[0] 


-L 
-L 
=  L 


Py(f)df 

\  \H(f)\2Px(f)df 

'  2 

—fo+Af/2 


(from  (18.11)) 

■fo+Af/2 


1  • 


—fo—Af/2 
r  fo+Af/2 

=  2/  1  -Px(f)df 

Jfo-Af/2 


Px{f)df  +  ['  1  -Px(f)df 

Jfo-Af/2 


(since  Px(~f)  =  Px(f)) 


If  we  let  A /  -+  0,  so  that  Px(f)  -+  Px{fo)  within  the  integration  interval,  this 
becomes  approximately 

ry[0]  =  2Px(fo)Af 


or 


px(h)  = 

Since  ry[ 0]  is  the  total  average  power  due  to  the  frequency  components  within  the 
bands  shown  in  Figure  18.3a,  which  is  twice  the  total  average  power  in  the  positive 
frequency  band,  we  have  that 


Px(fo) 


Total  average  power  in  band  [/o  —  A//2,  /o  +  A//2] 

A / 


(18.18) 
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This  says  that  the  PSD  Px(fo)  the  average  power  of  X[n]  in  a  small  band  of 
frequencies  about  f  =  fo  divided  by  the  width  of  the  band.  It  justifies  the  name  of 
power  spectral  density.  Furthermore,  to  obtain  the  average  power  within  a  frequency 
band  from  knowledge  of  the  PSD,  we  can  reverse  (18.18)  to  obtain 

Total  average  power  in  band  [fo  —  A//2,  fo  +  A//2]  =  Px(fo) A/ 

which  is  the  area  under  the  PSD  curve.  More  generally,  we  have  for  an  arbitrary 
frequency  band 

rf2 

Total  average  power  in  band  [/i,/2]  =  /  Px(f)df 

Jfi 

which  was  previously  asserted. 

18.5  Wiener  Filtering 

Armed  with  the  knowledge  of  the  mean  and  ACS  or  equivalently  the  mean  and 
PSD  of  a  WSS  random  process,  there  are  several  important  problems  that  can  be 
solved.  Because  the  required  knowledge  consists  of  only  the  first  two  moments  of 
the  random  process  (which  in  practice  can  be  estimated),  the  solutions  to  these 
problems  have  found  widespread  application.  The  generic  approach  that  results  is 
termed  Wiener  filtering ,  although  there  are  actually  four  slightly  different  problems 
and  corresponding  solutions.  These  problems  are  illustrated  in  Figure  18.4  and  are 
referred  to  as  filtering ,  smoothing ,  prediction ,  and  interpolation  [Wiener  1949].  In 
the  filtering  problem  (see  Figure  18.4a)  it  is  assumed  that  a  signal  S[n]  has  been 
corrupted  by  additive  noise  W[n\  so  that  the  observed  random  process  is  X[n\  = 
S[n]  +  W[n].  It  is  desired  to  estimate  S[n]  by  filtering  X[n\  with  an  LSI  filter  having 
an  impulse  response  h[k\.  The  filter  will  hopefully  reduce  the  noise  but  pass  the 
signal.  The  filter  estimates  a  particular  sample  of  the  signal,  say  £[no],  by  processing 
the  current  data  sample  X[no]  and  the  past  data  samples  {X[no  —  1],  X[no  —  2], . . .}. 
Hence,  the  filter  is  assumed  to  be  causal  with  an  impulse  response  h[k)  =  0  for 
k  <  0.  This  produces  the  estimator 


oo 

S[n0]  =  ^  h[k]X[no  —  k]  (18.19) 

k=0 

which  depends  on  the  current  sample,  containing  the  signal  sample  of  interest,  and 
past  observed  data  samples.  Presumably,  the  past  signal  samples  are  correlated 
with  the  present  signal  sample  and  hence  the  use  of  past  samples  of  X[n]  should 
enhance  the  estimation  performance.  This  type  of  processing  is  called  filtering  and 
can  be  implemented  in  real  time. 
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(a)  Filtering  (true  signal  shown  dashed  and 
displaced  to  right) 


(b)  Smoothing  (true  signal  shown  dashed  and 
displaced  to  right) 


Figure  18.4:  Definition  of  Wiener  “filtering”  problems. 


What  are  we  really  estimating  here? 


In  Section  7.9  we  attempted  to  estimate  the  outcome  of  a  random  variable,  which 
was  unobserved,  based  on  the  outcome  of  another  random  variable,  which  was  ob¬ 
served.  The  correlation  between  the  two  random  variables  allowed  us  to  do  this. 
Here  we  have  essentially  the  same  problem,  except  that  the  outcome  of  interest  to 
us  is  of  the  random  variable  S[tiq\.  The  random  variables  that  are  observed  are 
{X[no\,  X[uq  —  1],  •  •  •}  or  we  have  access  to  the  realization  (another  name  for  out¬ 
come)  {x[no],  x[no  —  1], . . .}.  Thus,  we  are  attempting  to  estimate  the  realization  of 
S[riQ ]  based  on  the  realization  {a; [no],  x[n$  —  1], . . .}.  This  should  be  kept  in  mind 
since  our  notation  of  S[rto]  =  YlkL oh[k\X[no  —  k]  seems  to  indicate  that  we  are 
attempting  to  estimate  a  random  variable  S[no]  based  on  other  random  variables 
{X[no],X[riQ  —  1], . . .}.  What  we  are  actually  trying  to  accomplish  is  a  procedure  of 
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estimating  a  realization  of  a  random  variable  based  on  realizations  of  other  random 
variables  that  will  work  for  all  realizations.  Hence,  we  employ  the  capital  letter 
notation  for  random  variables  to  indicate  our  interest  in  all  realizations  and  to  allow 
us  to  employ  expectation  operations  on  the  random  variables. 

A 

The  second  problem  is  called  smoothing  (see  Figure  18.4b).  It  differs  from 
filtering  in  that  the  filter  is  not  constrained  to  be  causal.  Therefore,  the  estimator 
becomes 

oo 

S[no]  =  Y  k[k]X[n0  ~  k }  (18.20) 

k— — oo 

A 

where  S[no]  now  depends  on  present,  past,  and  future  samples  of  X[n].  Clearly, 
this  is  not  realizable  in  real  time  but  can  be  approximated  if  we  allow  a  delay 
before  determining  the  estimate.  The  delay  is  necessary  to  accumulate  the  samples 

A 

{X[n o  +  l],X[no  +  2], . . .}  before  computing  S[no\.  Within  a  digital  computer  we 
would  store  these  “future”  samples. 

For  problems  three  and  four  we  observe  samples  of  the  WSS  random  process  X[n] 
and  wish  to  estimate  an  unobserved  sample.  For  prediction ,  which  is  also  called  ex¬ 
trapolation  and  forecasting ,  we  observe  the  current  and  past  samples  {X [no],  X[no  — 
1], . . .}  and  wish  to  estimate  a  future  sample,  X[uq  +  L],  for  some  positive  integer 
L.  The  prediction  is  called  an  L-step  prediction.  We  will  only  consider  one-step 
prediction  or  L  =  1  (see  Figure  18.4c).  The  reader  should  see  [Yaglom  1962]  for 
the  more  general  case  and  also  Problem  18.26  for  an  example.  The  predictor  then 
becomes 

oo 

X[n0  +  1]  =  Y  h[k]X[n0  -  k]  (18.21) 

k= 0 

which  of  course  uses  a  causal  filter.  For  interpolation  (see  Figure  18.4d)  we  observe 
samples  {. . . ,  X[riQ  —  1],  X[tiq  + 1], . . .}  and  wish  to  estimate  X[uq\.  The  interpolator 
then  becomes 

oo 

X[n0]  =  'Y,  h[k]X[no  —  k]  (18.22) 

k=  —  oo 
k^O 

which  is  a  noncausal  filter.  For  practical  implementation  of  (18.19)— (18.22)  we  must 
truncate  the  impulse  responses  to  some  finite  number  of  samples. 

To  determine  the  optimal  filter  impulse  responses  we  adopt  the  mean  square  error 
(MSE)  criterion.  Estimators  that  consist  of  LSI  filters  whose  impulses  are  chosen 
to  minimize  a  MSE  are  generically  referred  to  as  Wiener  filters  [Wiener  1949].  Of 
the  four  problems  mentioned,  we  will  solve  the  smoothing  and  prediction  problems. 
The  solution  for  the  filtering  problem  can  be  found  in  [Orfanidis  1985]  while  that  for 
the  interpolation  problem  is  described  in  [Yaglom  1962]  (see  also  Problem  18.27). 
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18.5.1  Wiener  Smoothing 

We  observe  X[n\  =  S[n]  +  W[n]  for  — oo  <  n  <  oo  and  wish  to  estimate  S[no] 
using  (18.20).  It  is  assumed  that  S[n]  and  W[n\  are  both  zero  mean  WSS  random 
processes  with  known  ACSs  (PSDs).  Also,  since  there  is  usually  no  reason  to  assume 
otherwise,  we  assume  that  the  signal  and  noise  random  processes  are  uncorrelated. 
This  means  that  any  sample  of  S\n]  is  uncorrelated  with  any  sample  of  W[n ]  or 
E[S[n\ ]  IT[?i2]]  =  0  for  all  n \  and  ?i2 •  The  MSE  for  this  problem  is  defined  as 

mse  =  £[e2[n0]]  =  E[(S'[n0]  -  S[no])2] 

where  6 [no]  =  S[no]  —  5[no]  is  the  error.  To  minimize  the  MSE  we  utilize  the 
orthogonality  principle  described  in  Section  14.7  which  states  that  the  error  should 
be  orthogonal,  i.e.,  uncorrelated,  with  the  data.  Since  the  data  consists  of  X[n\  for 
all  n,  the  orthogonality  principle  produces  the  requirement 

E[e[ri()]X[no  —  /]]  =  0  —  oo  <  l  <  oo. 


Thus,  we  have  that 


E[(S[no]-S[n0])X[n0-l}}  =  0 


E 


oo 


S[no]  “  h[k\x[no  ~  k]  )  X[n0  -  l ] 

k=— oo 


=  0  (from  (18.20)) 


which  results  in 


oo 


E[S[n0]X[n0  -  l]}  =  h[k]E[X[n0  -  k]X[n0  - 1}} 


(18.23) 


k=— oo 


But 


E[S[n0]X[n0  - 1}]  = 


E[S[n0](S[n0 
E[5[n0]5[n0  - 


-/]  +  W[n0  -  l ])] 

-  /]]  {S[n]  and  W[n]  are 

uncorrelated  and  zero  mean) 


=  rs[l] 


and 


E[X[no  -  k]X[n0  -  l]]  =  E[(S[ri0  -  k}  +  W[n0  -  fc])(S'[n0  -  l}  +  W[n0  -  £])] 

=  E[S[n0  -  fc]S,[n0  -  l ]]  +  E[W[n0  -  k]W[n0  -  /]] 

=  rs[l  —  k]  +  rw[l  -  k]. 

The  infinite  set  of  simultaneous  linear  equations  becomes  from  (18.23) 

OO 

rs[l]  =  ^2  h[k](rs[l-k]  +  rw[l-k]) 

k— — oo 


—  OO  <  l  <  00. 


(18.24) 
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Note  that  the  equations  do  not  depend  on  no  and  therefore  the  solution  for  the 
optimal  impulse  response  is  the  same  for  any  no-  This  is  due  to  the  WSS  assumption 
coupled  with  the  LSI  assumption  for  the  estimator,  which  together  imply  that  a  shift 
in  the  sample  to  be  estimated  results  in  the  same  filtering  operation  but  shifted.  To 
solve  this  set  of  equations  we  can  use  transform  techniques  since  the  right-hand  side 
of  (18.24)  is  seen  to  be  a  discrete-time  convolution.  It  follows  then  that 

rs[l]  =  h[l]  *  (rs[l\  +  rw[l ]) 
and  taking  Fourier  transforms  of  both  sides  yields 

Ps(f)  =  H(f)(Ps(f)  +  Pw(f)) 


or  finally  the  frequency  response  of  the  optimal  Wiener  smoothing  filter  is 

-  PsfflfiU-  (18'25) 

The  optimal  impulse  response  can  be  found  by  taking  the  inverse  Fourier  transform 
of  (18.25).  We  next  give  an  example. 

Example  18.4  -  Wiener  smoother  for  AR  signal  in  white  noise 

Consider  a  signal  that  is  an  AR  random  process  corrupted  by  additive  white  noise 
with  variance  a^y.  Then,  the  PSDs  are 


Psif) 

Pw(f) 


|1  -  «exp(-j27r/)|2 
_2 

aw- 


The  PSDs  and  corresponding  Wiener  smoother  frequency  responses  are  shown  in 
Figure  18.5.  In  both  cases  the  white  noise  variance  is  the  same,  a^y  =  1,  and  the 
AR  input  noise  variance  is  the  same,  o\j  =  0.5,  but  the  AR  filter  parameter  a  has 
been  chosen  to  yield  a  wide  PSD  and  a  very  narrow  PSD.  As  an  example,  consider 
the  case  of  a  =  0.9,  which  results  in  a  lowpass  signal  random  process  as  shown  in 
Figure  18.5b.  Then,  the  results  of  a  computer  simulation  are  shown  in  Figure  18.6. 
In  Figure  18.6a  the  signal  realization  s[n]  is  shown  as  the  dashed  curve  and  the 
noise  corrupted  signal  realization  x[n\  is  shown  as  the  solid  curve.  The  points  have 
been  connected  by  straight  lines  for  easier  viewing.  Applying  the  Wiener  smoother 
results  in  the  estimated  signal  shown  in  Figure  18.6b  as  the  solid  curve.  Once 
again  the  true  signal  realization  is  shown  as  dashed.  Note  that  the  estimated  signal 
shown  in  Figure  18.6b  exhibits  less  noise  fluctuations  but  having  been  smoothed, 
also  exhibits  a  reduced  ability  to  follow  the  signal  when  the  signal  changes  rapidly 
(see  the  estimated  signal  from  n  =  25  to  n  =  35).  This  is  a  standard  tradeoff  in 
that  noise  smoothing  is  obtained  at  the  price  of  poorer  signal  following  dynamics. 

0 
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(a)  a  =  0.2 


(b)  a  =  0.9 


-0.5 -0.4 -0.3 -0.2 -0.1  0  0.1  0.2  0.3  0.4  0.5  -0.5 -0.4 -0.3 -0.2 -0.1  0  0.1  0.2  0.3  0.4  0.5 

/  / 

(c)  a  =  0.2  (d)  a  =  0.9 


Figure  18.5:  Power  spectral  densities  of  the  signal  and  noise  and  corresponding 
frequency  responses  of  Wiener  smoother. 


In  order  to  implement  the  Wiener  smoother  for  the  previous  example  the  data  was 
filtered  in  the  frequency  domain  and  converted  back  into  the  time  domain.  This  was 
done  using  the  inverse  discrete-time  Fourier  transform 


Ps(f) 

Ps(f)+a2w 


XN(f)exp(j2nfn)df 


n  =  0, 1, . . .  ,N  —  1 


where  X^(f)  is  the  Fourier  transform  of  the  available  data  {^[0],  x[l], . . .  ,x[N  —  1]}, 
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(a)  True  (dashed)  and  noisy  (solid)  signal  (b)  True  (dashed)  and  estimated  (solid) 

signal 

Figure  18.6:  Example  of  Wiener  smoother  for  additive  noise  corrupted  AR  signal. 
The  true  PSDs  are  shown  in  Figure  18.5b.  In  a)  the  true  signal  is  shown  as  the 
dashed  curve  and  the  noisy  signal  as  the  solid  curve  and  in  b)  the  true  signal  is 
shown  as  the  dashed  curve  and  the  Wiener  smoothed  signal  estimate  (using  the 
Wiener  smoother  shown  in  Figure  18. 5d)  as  the  solid  curve. 


which  is 


N- 1 

Xn(I)  =  'Fj  x[n]exp(-j2vfn) 

71=0 


(N  —  50  for  the  previous  example) .  The  actual  implementation  used  an  inverse  FFT 
to  approximate  the  integral  as  is  shown  in  the  MATLAB  code  given  next.  Note  that 
in  using  the  FFT  and  inverse  FFT  to  calculate  the  Fourier  transform  and  inverse 
Fourier  transform,  respectively,  the  frequency  interval  has  been  changed  to  [0,1]. 
Because  the  Fourier  transform  is  periodic  with  period  one,  however,  this  will  not 
affect  the  result. 


clear  all 
randn( } state 3  ,0) 

a=0.9;varu=0.5; vars=varu/(l-a~2) ; varw=l ;N=50;  #/«  set  up  parameters 
for  n=0:N-l  7«  generate  signal  realization 
nn=n+l ; 

if  n==0  7,  use  Gaussian  random  processes 

s(nn,l)=sqrt(vars)*randn(l,l) ;  %  initialize  first  sample 

7.  to  avoid  transient 

else 

s  (nn , 1 ) =a* s (nn- 1 ) +sqrt ( varu) *randn (1,1); 
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end 

end 

x=s+sqrt  (varw)*randn(N,l) ;  7,  add  white  Gaussian  noise 
Nfft=1024;  7.  set  up  FFT  length 

7.  compute  PSD  of  signal,  frequency  interval  is  [0,1] 
Ps=varu./(abs(l-a*exp(-j*2*pi*[0:Nfft-l]  VNfft)) . "2) ; 
Hf=Ps./(Ps+varw)  ;  7*  form  Wiener  smoother 

sestf=Hf  .*fft(x,Nfft)  ;  7*  signal  estimate  in  frequency  domain, 

7o  frequency  interval  is  [0,1] 
sest=real(ifft (sestf  ,Nfft))  ;  7*  inverse  Fourier  transform 


One  can  also  determine  the  minimum  MSE  to  assess  how  well  the  smoother 
performs.  This  is 

msemin  =  S[(5[n0]  -  S[n0])2] 

=  E[(S[no]  -  5[n<,])S[no]]  -  E[(S[n0]  -  $[n0])£[no]]. 

But  the  second  term  is  zero  since  by  the  orthogonality  principle 


^[(-Sfno]  -  5,[n0])S,[no]]  =  E 


OO 


e[n0]  E  hopt[k]X[n0  -  k] 


k=—oo 


OO 


E  hopt[k]  E[e[n0]x[n0  -  k}}  =  0 

z - '  V -  ✓ 


k=— oo 


V- 

0 


Thus,  we  have 


msemin  =  S[(5[n0]  -  5[n0])5[n0]] 


=  rs[0]-£ 


OO 


E  hopt[k]X[n0  -  k]S[n0} 


Lk=—oo 


oo 


=  rs[ 0]-  V)  hopt[k]E[(S[n0-k]  +  W[n0-k})S[no}] 

N - V - ' 

°°  =E[S[no-k]S[no}]=rs[k] 

since  5[ni]  and  VE[n2]  are  uncorrelated  for  all  n\  and  n 2  and  also  are  zero  mean. 
The  minimum  MSE  becomes 


00 


msemin  —  rg[0]  ^  ^  ^opt  ^  [A;] . 


(18.26) 


k=— 00 


This  can  also  be  written  in  the  frequency  domain  by  using  Parseval’s  theorem  to 
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yield 


mse 


min 


r~ 

/  ’  Ps(f)df  -  /  "  Hopt(f)Ps(f)df 

J~h 

L 


-L 

■/. 


(1  -  Hopt(f))Ps(f)df 
Ps{f) 

Ps(f)+Pw(f ) 
PwU) 


Ps(f)df 


iPs(f)  +  Pw(f ) 


Ps(f)df 


((17.38)  and  Parseval) 


and  finally  letting  p(f)  =  Ps(f)/P\y(f)  be  the  signal-to- noise  ratio  in  the  frequency 
domain  we  have 


msemin 


Ps(f) 

1 + p(f) 


df. 


(18.27) 


It  is  seen  that  the  frequency  bands  for  which  the  contribution  to  the  minimum  MSE 
is  largest,  are  the  bands  for  which  the  signal-to- noise  ratio  is  smallest  or  for  which 

p(f)  <  i- 


18.5.2  Prediction 


We  consider  only  the  case  of  L  =  1  or  one-step  prediction.  The  more  general  case 
can  be  found  in  [Yaglom  1962]  (see  also  Problem  18.26).  As  before,  the  criterion  of 
MSE  is  used  to  design  the  predictor  so  that  from  (18.21) 


mse  =  E[(X[no  +  1]  —  X[n0  +  l])2] 


E 


OO 


A[n0  +  1]  -  h[k\X[n0  -  k] 


k= 0 


is  to  be  minimized  over  h[k]  for  k  >  0.  Invoking  the  orthogonality  principle  leads 
to  the  infinite  set  of  simultaneous  linear  equations 


E 


OO 


X[n0  +  1]  -  h[k]X[n0  -  k]  X[n0  -  l } 


k= 0 


=  0  /  =  0,1,.. 


These  equations  become 


OO 

E[X[n0  +  l]X[n0  -  l)}  =  h[k]E[X[n0  -  k]X[n0  -  l ]] 
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or  finally 

OO 

rx[l  +  1]  =  ^2  h[k]rx[l  —  k\  l  =  0, 1, ...  .  (18.28) 

k= o 

Note  that  once  again  the  optimal  impulse  response  does  not  depend  upon  no  so 
that  we  obtain  the  same  predictor  for  any  sample.  Although  it  appears  that  we 
should  be  able  to  solve  these  simultaneous  linear  equations  using  the  previous  Fourier 
transform  approach,  this  is  not  so.  Because  the  equations  are  only  valid  for  /  >  0 
and  not  for  l  <  0,  a  ^-transform  cannot  be  used.  Consider  forming  the  z-transform 
of  the  left-hand-side  as  +  1  \z~l  and  note  that  it  is  not  equal  to  zV(z). 

(See  also  Problem  18.15  to  see  what  would  happen  if  we  blindly  went  ahead  with 
this  approach.) 

The  minimum  MSE  is  evaluated  by  using  a  similar  argument  as  for  the  Wiener 
smoother 


mse 


min 


E 


OO 


A [n0  +  1]  -  hopt[k]X[n0  -  k]  )  X[n0  +  1] 

k= o 


OO 


=  rx  [o]  -  ^2  hopt  [k]rx  [k  +  1] 

k=0 


(18.29) 


where  hopt[k ]  is  the  impulse  response  solution  from  (18.28).  A  simple  example  for 
which  the  equations  of  (18.28)  can  be  solved  is  given  next. 

Example  18.5  —  Prediction  of  AR  random  process 

Consider  an  AR  random  process  for  which  the  ACS  is  given  by  rx[k]  =  (cr^/(l  — 
a2))a\k\  =  rx[0]a^.  Then  from  (18.28) 

OO 

rx[0]alz+1l  =  h[k]rx[0]a>\l~k\  l  =  0, 1, . . . 

k= 0 

and  if  we  let  h[k]  =  0  for  k  >  1,  we  have 

al/+1l  =  h[0]a)l\  l  =  0, 1, . . .  . 


Since  l  >  0,  the  solution  is  easily  seen  to  be 

hopt[0]  =  — i  =  a 
a1 

or  finally 

A  [no  +  1]  =  clX[tiq]. 

Also,  since  this  is  true  for  any  no,  we  can  replace  the  specific  sample  by  a  more 
general  sample  by  replacing  no  by  n  —  1.  This  results  in 

X[n\  =  aX[n  -  1]. 


(18.30) 
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Recalling  that  the  AR  random  process  is  defined  as  X[n]  =  aX[n  —  1]  +  C/[n],  it  is 
now  seen  that  the  optimal  one-step  linear  predictor  is  obtained  from  the  definition 
by  ignoring  the  term  U[n].  This  is  because  U[n]  cannot  be  predicted  from  the 
past  samples  {X[n  —  l],X[n  —  2], . . .},  which  are  uncorrelated  with  U[n]  (see  also 
Example  17.5).  Furthermore,  the  prediction  error  is  e[n\  —  X[n]  —  X[n\  —  X[n ]  — 
aX[n  —  1]  =  U[n\.  Finally,  note  that  the  prediction  only  depends  on  the  most  recent 
sample  and  not  on  the  past  samples  of  X[n\.  In  effect,  to  predict  A  [no  +  1]  all 
the  past  information  of  the  random  process  is  embodied  in  the  sample  X[no\.  To 
illustrate  the  prediction  solution  consider  the  AR  random  process  whose  parameters 
and  realizations  were  shown  in  Figure  17.5.  The  realizations,  along  with  the  one-step 
predictions,  shown  as  the  “*”s,  are  given  in  Figure  18.7.  Note  the  good  predictions 


3  - ! - 1 - ■ - ! - 1 - 

2 . ; . ;• . ; . ;• . ; . 

__  i . . . I . i . i . 

-1 . 1* . ** .  .  .  . 

«»** 

-2 . . ; . . ;• . i . 

_ 0  - 1 _ i _ i _ i _ i _ 

0  5  10  15  20  25  30 

Tl 

(a)  a  =  0.25,  afj  =  1  —  a2  (b)  a  =  0.98,  afj  =  1  —  a2 

Figure  18.7:  Typical  realizations  of  autoregressive  random  process  with  different 

A 

parameters  and  their  one-step  linear  predictions  indicated  by  the  “*”s  as  X[n+ 1]  = 
ax[n ]. 

for  the  AR  random  process  with  a  =  0.98  but  the  relatively  poor  ones  for  the  AR 
random  process  with  a  =  0.25.  Can  you  justify  these  results  by  comparing  the 
minimum  MSEs?  (See  Problem  18.17.) 

❖ 

The  general  solution  of  (18.28)  is  fairly  complicated.  The  details  are  given  in  Ap¬ 
pendix  18A.  We  now  summarize  the  solution  and  then  present  an  example. 

1.  Assume  that  the  ^-transform  of  the  ACS,  which  is 


-  1 

■  '  1 - 

- 1 

"  '  1 

— 

I!1 

miiil 

[IW 

!•**« 

•* 

♦ 

_ tk i _ 

o** 

.  1  — 

1 

1  . 

. . 

oo 

Vx(z)  =  rx[k]z~k 

k— — oo 


620 


CHAPTER  18.  LINEAR  SYSTEMS  AND  WSS  RANDOM  PROCESSES 


can  be  written  as 


Vx(z)  = 


a 


u 


where 


A(z)A(z~1) 


oo 


(18.31) 


A(z)  =  1  —  ^2  o\k\z 

k= 1 


It  is  required  that  A(z)  have  all  its  zeros  inside  the  unit  circle  of  the  z- plane, 

i.e.,  the  filter  with  ^-transform  1/A(z)  is  a  stable  and  causal  filter  [Jackson 
1991]. 

2.  The  solution  of  (18.28)  for  the  impulse  response  is 

h0 pt[fc]  =  a[k  +  1]  k  =  0, 1, . . . 

and  the  minimum  MSE  is 

mSemin  —  E[(X[tIq  +  1]  X[n  o  +  1])^]  —  Ojj' 

3.  The  optimal  linear  predictor  becomes  from  (18.21) 


oo 

X[n0  +  1]  =  a[k  +  l]X[n0  -  k]  (18.32) 

k= 0 


and  has  the  minimum  MSE,  msemin  -  afj. 


Clearly,  the  most  difficult  part  of  the  solution  is  putting  Vx{z)  into  the  required 
form  of  (18.31).  In  terms  of  the  PSD  the  requirement  is 


Px(f)  =  Vx(exp(j2nf)) 


A(exp(j2nf))A(exp{-j2nf)) 


»4(exp  (j  27r/))yl*(exp(ji'27r/)) 
A(exp(j2n  f))\2 


I1  -  EfcLi  a[k]  exp(-j2?r/fc)|2 


But  the  form  of  the  PSD  is  seen  to  be  a  generalization  of  the  PSD  for  the  AR 
random  process.  In  fact,  if  we  truncate  the  sum  so  that  the  required  PSD  becomes 


1  -  ELi  aik ]  exp(-j2nfk) 


PxU)  = 
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then  we  have  the  PSD  of  what  is  referred  to  as  an  AR  random  process  of  order  p , 
which  is  also  denoted  by  the  symbolism  AR(p).  In  this  case,  the  random  process  is 
defined  as 

p 

X[n]  =  £  a[k]X[n  -k]  +  U[n ]  (18.33) 

k= l 

where  as  usual  U[n]  is  white  Gaussian  noise  with  variance  a Of  course,  for  p  =  1  we 
have  our  previous  definition  of  the  AR  random  process,  which  is  an  AR(1)  random 
process  with  a[l]  =  a.  Assuming  an  AR(p)  random  process  so  that  a[l\  =  0  for 
l  >  p,  the  solution  for  the  optimal  one-step  linear  predictor  is  from  (18.32) 

p- 1 

X[no  +  1]  =  ^2  a[l  +  ~  l] 

1=0 

and  letting  k  =  l  +  1  produces 

p 

X[no  +  1]  =  ^2  a[k\X[no  +  1  —  k]  (18.34) 

k= 1 

and  the  minimum  MSE  is  afj.  Another  example  follows. 

Example  18.6  —  One- step  linear  prediction  of  MA  random  process 

Consider  the  zero  mean  WSS  random  process  given  by  X[n]  =  U[n ]  —  bU[n  —  1], 
where  \b\  <  1  and  U[n]  is  white  Gaussian  noise  with  variance  (also  called  an  MA 
random  process).  This  random  process  is  a  special  case  of  that  used  in  Example 
18.1  for  which  /&[0]  =  1  and  h[  1]  =  —b  and  U[n\  is  white  Gaussian  noise.  To  find  the 
optimal  linear  predictor  we  need  to  put  the  ^-transform  of  the  ACS  into  the  required 
form.  First  we  determine  the  PSD.  Since  the  system  function  is  easily  shown  to  be 
Hiz)  =  1  —  bz~l ,  the  frequency  response  follows  as  H(f)  =  1  —  6exp(— j2irf).  From 
(18.14)  the  PSD  becomes 

Px(f)  =  H{f)H*(f)ofj  =  (1  -  bexp(—j2nf))(l  -  &exp(j27r/))<7^ 


and  hence  replacing  exp(j‘27r/)  by  z,  we  have 


Vx{z)  =  (1  -  bz  1)(1  -  bz)a1f. 


(18.35) 


By  equating  (18.35)  to  the  required  form  for  Vx{z)  given  in  (18.31)  we  have 


A{z) 


1 

1  —  bz~l 


To  convert  this  to  1  —  YlkLi  a[k\z  k ?  we  take  the  inverse  ^-transform,  assuming  a 
stable  and  causal  sequence,  to  yield 


bk  k  >  0 
0  k  <  0 
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and  so  a[k]  =  —  bk  for  k  >  1.  (Note  why  \b\  <  1  is  required  or  else  a[n\  would  not  be 
stable.)  The  optimal  predictor  is  from  (18.32) 


X[n0  + 1] 


OO 


a[k  +  l]X[n0  -  k] 

k=0 


oo 


^(-M+1)X[n0  -  k] 


k= 0 


=  -bX[n0]  -  b2X[n0  -  1]  -  b3X[n0  -  2] 


and  the  minimum  MSE  is 

nisemin  =  &U‘ 

0 

As  a  special  case  of  practical  interest,  we  next  consider  a  finite  length  one-step 
linear  predictor.  By  finite  length  we  mean  that  the  prediction  can  only  depend 
on  the  present  sample  and  past  M  —  1  samples.  In  a  derivation  similar  to  the 
infinite  length  predictor  it  is  easy  to  show  (see  the  discussion  in  Section  14.8  and 
also  Problem  18.20)  that  if  the  predictor  is  given  by 


M- 1 

A[no  +  1]  =  ^2  h[k]X[no  —  k] 

k= o 

which  is  just  (18.21)  with  h[k]  =  0  for  k  >  M,  then  the  optimal  impulse  response 
satisfies  the  M  simultaneous  linear  equations 


M- 1 

rx[l  +  1]  =  ^2  h[k]rx[l  —  k]  l  =  0, 1, . . .  ,M  —  1. 

k= o 


(If  M  oo,  these  equations  are  identical  to  (18.28)).  The  equations  can  be  written 
in  vector /matrix  form  as 


rx[0]  rx[l]  ...  rx[M  —  1] 

rx[l]  rx[0]  ...  rx[M  -  2] 


rx[M  -  1]  rx[M  -  2]  ...  rx[0] 

-v- 

% 


The  corresponding  minimum  MSE  is  given  by 


Mo] 

»~x[l] 

Mi] 

• 

____ 

rx[2\ 

• 

• 

• 

_  h[M  -  1]  _ 

• 

• 

.  rx[M]  _ 

✓ 


(18.36) 


M— 1 

msemin  =  Of[0]  ~  ^  ^  ^opt  !]•  (18.37) 
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These  equations  are  called  the  Wiener-Hopf  equations .  In  general,  they  must  be 
solved  numerically  but  there  are  many  efficient  algorithms  to  do  so  [Kay  1988]. 
The  algorithms  take  advantage  of  the  structure  of  the  matrix  which  is  seen  to  be 
an  autocorrelation  matrix  as  first  described  in  Section  17.4.  As  such,  it  is 
symmetric,  positive  definite,  and  has  the  Toeplitz  property.  The  Toeplitz  property 
asserts  that  the  elements  along  each  northwest-southeast  diagonal  are  identical. 
Another  important  connection  between  the  linear  prediction  equations  and  an  AR(p) 
random  process  is  made  by  letting  M  —  p  in  (18.36).  Then,  since  for  an  AR(p) 
process,  we  have  that  h[n]  =  a[n  +  1]  for  n  =  0, 1, . . .  ,p  -  1  (recall  from  (18.34)  that 
A  [no  +  1]  =  ]C*Ui  a[k\X[riQ  +  1  —  k])  the  Wiener-Hopf  equations  become 


rx  [0] 

rx[  1] 

• 

• 

■ 

rx[  1] 

rx[0] 

• 

• 

•  •  •  rx\p  -  1]  " 
...  rx\p-  2] 

•  • 

•  • 

•  • 

’  «[!]  " 
°[  2] 

• 

• 

— 

’  rx[l]  " 

rx[  2] 

• 

• 

(18.38) 

_  rx\p~  1] 

rX\p~  2] 

rx[0] 

.  Q\P\  . 

• 

.  rx\p]  . 

It  is  important  to  note  that  for  an  AR(p)  random  process ,  the  optimal  one-step  linear 
predictor  based  on  the  infinite  number  of  samples  {A [no],  A  [no  —  1], . . .}  is  the  same 
as  that  based  on  only  the  finite  number  of  samples  {A  [no],  A  [no  —  1], . . . ,  A  [no  —  (p— 
1)]}  [Kay  1988].  The  equations  of  (18.38)  are  now  referred  to  as  the  Yule-Walker 
equations.  In  this  form  they  relate  the  ACS  samples  {r*[0],rx[l], . .  .rx\p]}  to  the 
AR  filter  parameters  {a[l],a[2], . . .  ,a[p]}.  If  the  ACS  samples  are  known,  then  the 
AR  filter  parameters  can  be  obtained  by  solving  the  equations.  Furthermore,  once 
the  filter  parameters  have  been  found  from  (18.38),  the  variance  of  the  white  noise 
random  process  ?7[n]  is  found  from 

v 

(jjj  =  ftisejnin  =  Of[0]  ^  ^  Q Mor  [fc]  (18.39) 

k= 1 

which  follows  by  letting  hopt[k\  =  a[k  +  1]  with  M  =  p  in  (18.37).  In  the  real-world 
example  of  Section  18.7  we  will  see  how  these  equations  can  provide  a  method  to 
synthesize  speech. 


18.6  Continuous-Time  Definitions  and  Formulas 


For  a  continuous-time  WSS  random  process  as  defined  in  Section  17.8  the  linear 
system  of  interest  is  a  linear  time  invariant  (LTI)  system.  It  is  characterized  by  its 
impulse  response  h(r).  If  a  random  process  U(t)  is  input  to  an  LTI  system  with 
impulse  response  h(r)  .  the  output  random  process  X  (t)  is 


h{r)U{t  —  r)dr. 
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The  integral  is  referred  to  as  a  convolution  integral  and  in  shorthand  notation  the 
output  is  given  by  X(t)  =  h(t)  *  U{t).  If  U(t)  is  WSS  with  constant  mean  Hu 
and  ACF  ru(r),  then  the  output  random  process  X(t)  is  also  WSS.  It  has  a  mean 
function 


Hx  = 


Hu  =  H(0)hu 


(18.40) 


where 

/OO 

h(r)  exp(—j27rFT)dr 

-oo 

is  the  frequency  response  of  the  LTI  system.  The  ACF  of  the  output  random  process 
X(t)  is 

rx(j)  =  h(— r)  'krjj(r)  (18.41) 

and  therefore  the  PSD  becomes 


PX(F)  =  |  H(F)\2Pu(F). 


(18.42) 


An  example  follows. 

Example  18.7  —  Inteference  rejection  filter 

A  signal,  which  is  modeled  as  a  WSS  random  process  S'(t),  is  corrupted  by  an 
additive  interference  /(£),  which  can  be  modeled  as  a  randomly  phased  sinusoid 
with  a  frequency  of  Fq  =  60  Hz.  The  corrupted  signal  is  X(t)  =  S(t)  +  /(£).  It 
is  desired  to  filter  out  the  interference  but  if  possible,  to  avoid  altering  the  PSD 
of  the  signal  due  to  the  filtering.  Since  the  sinusoidal  interference  has  a  period  of 
T  =  1/jFo  =  1/60  seconds,  it  is  proposed  to  filter  X(t)  with  the  differencing  filter 

Y(t)  =  X(t)  -  X(t  -  T ).  (18.43) 

The  motivation  for  choosing  this  type  of  filter  is  that  a  periodic  signal  with  period 
T  will  have  the  same  value  at  any  two  time  instants  separated  by  T  seconds.  Hence, 
the  difference  should  be  zero  for  all  t.  We  wish  to  determine  the  PSD  at  the  filter 
output.  We  will  assume  that  the  interference  is  uncorrelated  with  the  signal.  This 
assumption  means  that  the  ACF  of  X(t )  is  the  sum  of  the  ACFs  of  S(t)  and  I(t) 
and  consequently  the  PSDs  sum  as  well  (see  Problem  18.33).  The  differencing  filter 
is  an  LTI  system  and  so  its  output  can  be  written  as 

/oo 

h(T)X(t  —  T)d,T  (18.44) 

-OO 

for  the  appropriate  choice  of  the  impulse  response.  The  impulse  response  is  obtained 
by  equating  (18.44)  to  (18.43)  from  which  it  follows  that 


h(r)  =  6(t)  —  6(r  —  T) 


(18.45) 
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as  can  easily  be  verified.  By  taking  the  Fourier  transform,  the  frequency  response 
becomes 

/OO 

(6(t)  -  6(t  -  T))  exp{-j2TrFT)dT 

-OO 

-  1  -  exp(— j27rFT).  (18.46) 

To  determine  the  PSD  at  the  filter  output  we  use  (18.42)  and  note  that  for  the 
randomly  phased  sinusoid  with  amplitude  A  and  frequency  Fo,  the  ACF  is  (see 
Problem  17.46) 

A2 

rl(r)  =  —  cos(2ttFot) 

and  therefore  its  PSD,  which  is  the  Fourier  transform,  is  given  by 

Pi{F)  =  ^6{F  +  F0)  +  ^ 6(F  -  F0). 

The  PSD  at  the  filter  input  is  Px(F )  =  Ps(P)  +  Pi(F )  (the  PSDs  add  due  to  the 
uncorrelated  assumption)  and  therefore  the  PSD  at  the  filter  output  is 

PY(F)  =  \H(F)\2Px(F )  -  \H(F)\2(PS(F)  +  Pt(F)) 

=  |1  -  exp(-j2nFT)\2(Ps(F)  +  Pr(F)). 

The  magnitude-squared  of  the  frequency  response  of  (18.46)  can  also  be  written  in 
real  form  as 

\H(F)\2  =  2-2cos(2tt  FT) 

and  is  shown  in  Figure  18.8.  Note  that  it  exhibits  zeros  at  multiples  of  F  —  1/T  = 


Figure  18.8:  Magnitude-squared  frequency  response  of  interference  canceling  filter 
with  Fq  =  1/T. 
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F() .  Hence,  |i/(i'o)  |2  =  0  and  so  the  interfering  sinusoid  is  filtered  out.  The  PSD  at 
the  filter  output  then  becomes 

Py(F)  =  \H(F)\2Ps(F) 

=  2(1  -cos(2tt  FT))  PS(F). 

Unfortunately,  the  signal  PSD  has  also  been  modified.  What  do  you  think  would 
happen  if  the  signal  were  periodic  with  period  1  / (2Fq)‘! 

❖ 

18.7  Real-World  Example  —  Speech  Synthesis 

It  is  commonplace  to  hear  computer  generated  speech  when  asking  for  directory 
assistance  in  obtaining  telephone  numbers,  in  using  text  to  speech  conversion  pro¬ 
grams  in  computers,  and  in  playing  with  a  multitude  of  children’s  toys.  One  of  the 
earliest  applications  of  computer  speech  synthesis  was  the  Texas  Instruments  Speak 
and  Spell1.  The  approach  to  producing  intelligible,  if  not  exactly  human  sounding, 
speech,  is  to  mimic  the  human  speech  production  process.  A  speech  production 
model  is  shown  in  Figure  18.9  [Rabiner  and  Schafer  1978].  It  is  well  known  that 
speech  sounds  can  be  delineated  into  two  classes — voiced  speech  such  as  a  vowel 
sound  and  unvoiced  speech  such  as  a  consonant  sound.  A  voiced  sound  such  as 
“ahhh”  (the  o  in  “lot”  for  example)  is  produced  by  the  vibration  of  the  vocal  cords, 
while  an  unvoiced  sound  such  as  “sss”  (the  s  in  “runs”  for  example)  is  produced 
by  passing  air  over  a  constriction  in  the  mouth.  In  either  case,  the  sound  is  the 
output  of  the  vocal  tract  with  the  difference  being  the  excitation  sound  and  the 
subsequent  filtering  of  that  sound.  For  voiced  sounds  the  excitation  is  modeled  as 
a  train  of  impulses  to  produce  a  periodic  sound  while  for  an  unvoiced  sound  it  is 
modeled  as  white  noise  to  produce  a  noise- like  sound  (see  Figure  18.9).  The  excita- 


Figure  18.9:  Speech  production  model. 

tion  is  modified  by  the  vocal  tract,  which  can  be  modeled  by  an  LSI  filter.  Knowing 
1  Registered  trademark  of  Texas  Instruments 
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the  excitation  waveform  and  the  vocal  tract  system  function  allows  us  to  synthesize 
speech.  For  the  unvoiced  sound  we  pass  discrete  white  Gaussian  noise  through  an 
LSI  filter  with  system  function  7/Uv(^)-  We  next  concentrate  on  the  synthesis  of 
unvoiced  sounds  with  the  synthesis  of  voiced  sounds  being  similar. 

It  has  been  found  that  a  good  model  for  the  vocal  tract  is  the  LSI  filter  with 
system  function 

T^uv(^)  z  FTl  —k 

!-£*= ialk\z 

which  is  an  all-pole  filter.  Typically,  the  order  of  the  filter  p,  which  is  the  number 
of  poles,  is  chosen  to  be  p  =  12.  The  output  of  the  filter  X[n\  for  a  white  Gaussian 
noise  random  process  input  U[n\  with  variance  a ^  is  given  as  the  WSS  random 
process 

p 

=  E  a[k]X[n  —  k]  +  U[n] 

k= 1 

which  is  recognized  as  the  defining  difference  equation  for  an  AR(p)  random  process. 
Hence,  unvoiced  speech  sounds  can  be  synthesized  using  this  difference  equation  for 
an  appropriate  choice  of  the  parameters  {a[l],a[2], . . .  ,o[p],<r^}.  The  parameters 
will  be  different  for  each  unvoiced  sound  to  be  synthesized.  To  determine  the  pa¬ 
rameters  for  a  given  sound,  a  segment  of  the  target  speech  sound  is  used  to  estimate 
the  ACS.  Estimation  of  the  ACS  was  previously  described  in  Section  17.7.  Then, 
the  parameters  a[&]  for  k  =  1, 2, . . .  ,p  can  be  obtained  by  solving  the  Yule- Walker 
equations  (same  as  Wiener-Hopf  equations) .  The  theoretical  ACS  samples  required 
are  replaced  by  estimated  ones  to  yield  the  set  of  simultaneous  linear  equations  from 
(18.38)  as 


fx  [o] 

rx-[i] 

...  rx\p~  1]  ' 

r  «[i]  l 

’  rx[  1]  ' 

r*[l] 

■ 

• 

• 

rx[0] 

• 

• 

• 

...  rx\p-2] 

•  • 

■  • 

•  . 

°[2] 

• 

• 

— 

rx[  2] 

• 

• 

(18.47) 

rx\p  ~  1] 

rx\p~  2] 

fx[0] 

.  a\P ]  . 

.  fx\p]  . 

which  are  solved  to  yield  the  affcj’s.  Then,  the  white  noise  variance  estimate  is  found 
from  (18.39)  as 

p 

°u  =  Or[0]  -  ^2 a[k)fx[k]  (18.48) 

k=l 

where  a[k]  is  given  by  the  solution  of  the  Yule- Walker  equations  of  (18.47).  Hence,  we 
estimate  the  ACS  for  lags  k  =  0, 1, . . .  ,p  based  on  an  actual  speech  sound  and  then 
solve  the  equations  of  (18.47)  to  obtain  {a[l],a[2], . . .  ,a[p]}  and  finally,  determine 
o\j  using  (18.48).  The  only  modification  that  is  commonly  made  is  to  the  ACS 
estimate,  which  is  chosen  to  be 

j  N—l—k 

?x[k\  =  jz  x[n\x[n  +  k]  k  =  0,1, 

71—0 


•  •  • 


,P 


(18.49) 
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and  which  differs  from  the  one  given  in  Section  17.7  in  that  the  normalizing  factor 
is  N  instead  of  N  —  k.  For  N  p  this  will  have  minimal  effect  on  the  parameter 
estimates  but  has  the  benefit  of  ensuring  a  stable  filter  estimate,  i.e.,  the  poles  of 

A 

Hu vW  will  He  inside  the  unit  circle  [Kay  1988].  This  method  of  estimating  the 
AR  parameters  is  called  the  autocorrelation  method  of  linear  prediction.  The  entire 
procedure  of  modeling  speech  by  an  AR (p)  model  is  referred  to  as  linear  predictive 
coding  (LPC).  The  name  originated  with  the  connection  of  (18.47)  as  a  set  of  linear 
prediction  equations,  although  the  ultimate  goal  here  is  not  linear  prediction  but 
speech  modeling  [Makhoul  1975]. 

To  demonstrate  the  modeling  of  an  unvoiced  sound  consider  the  spoken  word 
“seven”  shown  in  Figure  18.10.  A  portion  of  the  “sss”  utterance  is  shown  in  Figure 

i 

os 
0.6 
0.4 

^  02 
-to 

H  0 
-0.2 
-0.4 
-0.6 
-0.8 
-1 

0  0.1  02  03  0.4  05  0.6 

t  (sec) 


Figure  18.10:  Waveform  for  the  utterance  “seven”  [Allu  2005]. 


18.11  and  as  expected  is  noise-like.  It  is  composed  of  the  samples  indicated  between 
the  dashed  vertical  lines  in  Figure  18.10.  Typically,  in  analyzing  speech  sounds  to 
estimate  its  AR  parameters,  we  sample  at  8  KHz  and  use  a  block  of  data  20  msec 
(about  160  samples)  in  length.  The  samples  of  x(t)  in  Figure  18.10  from  t  =  115 
msec  to  t  =  135  msec  are  shown  in  Figure  18.11.  With  a  model  order  of  p  =  12  we  use 
(18.49)  to  estimate  the  ACS  lags  and  then  solve  the  Yule- Walker  equations  of  (18.47) 
and  also  use  (18.48)  to  yield  the  estimated  parameters  {a[l],a[2], . . .  ,a[p],<j^}.  If 
the  model  is  reasonably  accurate,  then  the  synthesized  sound  should  be  perceived 
as  being  similar  to  the  original  sound.  It  has  been  found  through  experimentation 
that  if  the  PSDs  are  similar,  then  this  will  be  the  case.  Hence,  the  estimated  PSD 


PxU) 


a 


u 


(18.50) 


1  “  ELi  exp(-j2nfk)  \ 

should  be  a  good  match  to  the  normalized  and  squared-magnitude  of  the  Fourier 
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n 

Figure  18.11:  A  20  msec  segment  of  the  waveform  for  “sss”.  See  Figure  18.10  for 
segment  extracted  as  indicated  by  the  vertical  dashed  lines. 


transform  of  the  speech  sound.  The  latter  is  of  course  the  periodogram.  We  need 
only  consider  the  match  in  power  since  it  is  well  known  that  the  ear  is  relatively 
insensitive  to  the  phase  of  the  speech  waveform  [Rabiner  and  Schafer  1978]. 

As  an  example,  for  the  portion  of  the  “sss”  sound  shown  in  Figure  18.11  a 
periodogram  as  well  as  the  AR  PSD  model  of  (18.50),  is  compared  in  Figure  18.12. 
Both  PSDs  are  plotted  in  dB  quantities,  which  is  obtained  by  taking  101og10  of  the 
PSD.  Note  that  the  resonances,  i.e.,  the  portions  of  the  PSD  that  are  large  and 
which  are  most  important  for  intelligibility,  are  well  matched  by  the  model.  This 
verifies  the  validity  of  the  AR  model.  Finally,  to  synthesize  the  “sss”  sound  we 
compute 

p 

x[n]  =  a[k\x[n  —  k]  +  u[n\ 
k= 1 

where  u[n\  is  a  pseudorandom  Gaussian  noise  sequence  [Knuth  1981]  with  variance 
<jjy,  for  a  total  of  about  20  msec.  Then,  the  samples  are  converted  to  an  analog 
sound  using  a  digital-to-analog  (D/A)  convertor  (see  Figure  18.9).  The  TI  Speak 
and  Spell  used  p  =  10  and  stored  the  AR  parameters  in  memory  for  each  sound. 
The  MATLAB  code  used  to  generate  Figure  18.12  is  given  below. 

N=length(xseg) ;  7«  xseg  is  the  data  shown  in  Figure  18.11 
Nfft=1024;  7.  set  up  FFT  length  for  Fourier  transforms 
freq=[0:Nfft-l]  VNfft-0.5;  7*  PSD  frequency  points  to  be  plotted 
P_per=(l/N)*abs(fftshift (fft (xseg,Nfft))) . "2;  7«  compute  periodogram 
p=12;  7o  dimension  of  autocorrelation  matrix 
for  k=l:p+l  7o  estimate  ACS  for  k=0,l,...,p  (MATLAB  indexes 
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Figure  18.12:  Periodogram,  shown  as  the  light  line,  and  AR  PSD  model,  shown  as 
the  darker  line  for  speech  segment  of  Figure  18.11. 


7®  must  start  at  1) 

rX(k, l)=(l/N)*sum(xseg(l :N-k+l) .*xseg(k:N)) ; 

end 

r=rX(2:p+l);  7,  fill  in  right -hand- side  vector 
for  i=l:p  7,  fill  in  autocorrelation  matrix 
for  j=l:p 

R(i, j)=rX(abs(i-j)+l) ; 

end 

end 

a=inv(R)*r;  7*  solve  linear  equations  to  find  AR  filter  parameters 
varu^XCD-a* *r ;  7«  find  excitation  noise  variance 

den=abs(fftshift(fft([l;-a] ,Nfft))) ."2;  7®  compute  denominator  of  AR  PSD 
P_AR=varu./den;  7.  compute  AR  PSD 

See  also  Problem  18.34  for  an  application  of  AR  modeling  to  spectral  estimation 
[Kay  1988]. 
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Problems 

18.1  (^)  (f)  An  LSI  system  with  system  function  'H(z)  =  1  —  z~l  —  z~2  is  used 
to  filter  a  discrete-time  white  noise  random  process  with  variance  afj  =  1. 
Determine  the  ACS  and  PSD  of  the  output  random  process. 

18.2  (f)  A  discrete-time  WSS  random  process  with  mean  nu  =  2  is  input  to  an  LSI 
system  with  impulse  response  h[n]  =  (1/2)"  for  n  >  0  and  h[n ]  =  0  for  n  <  0. 
Find  the  mean  sequence  at  the  system  output. 

18.3  (w)  A  discrete-time  white  noise  random  process  U[n]  is  input  to  a  system  to 
produce  the  output  random  process  X[n]  =  J"l{7[n]  for  |a|  <  1.  Determine 
the  output  PSD. 

18.4  (^)  (w)  A  randomly  phased  sinusoid  X[n]  =  cos(27r(0.25)n  +  0)  with  0  ~ 
U{ 0,  27t)  is  input  to  an  LSI  system  with  system  function  H(z)  =  1  —  b\z~l  — 
b2z -2.  Determine  the  filter  coefficients  h ,  b2  so  that  the  sinusoid  will  have 
zero  power  at  the  filter  output. 
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18.5  (f,c)  A  discrete-time  WSS  random  process  X[n]  is  defined  by  the  difference 
equation  X[n\  —  aX[n  —  1]  +  U[n\  —  bU[n  —  1],  where  U[n\  is  a  discrete-time 
white  noise  random  process  with  variance  =  1.  Plot  the  PSD  of  X[n]  if 
a  =  0.9,  b  =  0.2  and  also  if  a  =  0.2,  b  =  0.9  and  explain  your  results. 

18.6  (f)  A  discrete-time  WSS  random  process  X[n\  is  defined  by  the  difference 
equation  X[n\  =  0.5 X[n  —  1]  +  U[n\  —  0.5 U[n  —  1],  where  U[n\  is  a  discrete¬ 
time  white  noise  random  process  with  variance  o\j  —  1.  Find  the  ACS  and 
PSD  of  X[n]  and  explain  your  results. 

18.7  (o)  (f )  A  differencer  is  given  by  X[n]  =  U[n]  —  U[n  —  1].  If  the  input  random 
process  U[n]  has  the  PSD  Pu{f)  =  1  —  cos(27 r/),  determine  the  ACS  and  PSD 
at  the  output  of  the  differencer. 

18.8  (t)  Verify  that  the  discrete-time  Fourier  transform  of  rx[k]  given  in  (18.15)  is 

4ltf(/)l2- 

18.9  (w)  A  discrete-time  white  noise  random  process  is  input  to  an  LSI  system 
which  has  h[0]  =  1  with  all  the  other  impulse  response  samples  nonzero.  Can 
the  output  power  of  the  filter  ever  be  less  than  the  input  power? 

18.10  (w)  A  random  process  with  PSD 

Px(f)  =  - - - - - 

1  -  iexp(-y27r/)| 

is  to  be  filtered  with  an  LSI  system  to  produce  a  white  noise  random  process 
U[n\  with  variance  —  4  at  the  output.  What  should  the  difference  equation 
of  the  LSI  system  be? 

18.11  (w,c)  An  AR  random  process  of  order  2  is  given  by  the  recursive  difference 
equation  X[n\  =  2 r  cos(27r/o) X[n  —  1]  —  r2X[n  —  2]  +  i7[n],  where  U[n\  is  white 
Gaussian  noise  with  variance  o\j  —  1.  For  r  =  0.7,  /o  =  0.1  and  also  for 
r  —  0.95, /o  =  0.1  plot  the  PSD  of  X[n\.  Can  you  explain  your  results?  Hint: 
Determine  the  pole  locations  of  'H(z). 

18.12  (w)  A  signal,  which  is  bandlimited  to  B  cycles/sample  with  B  <  1/2,  is 
modeled  as  a  WSS  random  process  with  zero  mean  and  PSD  Ps(f).  If  white 
noise  is  added  to  the  signal  with  crjy  =  1,  find  the  frequency  response  of  the 
optimal  Wiener  smoother.  Explain  your  results. 

18.13  (^)  (f,c)  A  zero  mean  signal  with  PSD  Ps(f)  =  2-2  cos(27 r/)  is  embedded 
in  white  noise  with  variance  <r^  =  1.  Plot  the  frequency  response  of  the 
optimal  Wiener  smoother.  Also,  compute  the  minimum  MSE.  Hint:  For  the 
MSE  use  a  “sum”  approximation  to  the  integral  (see  Problem  1.14). 
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18.14  (c)  In  this  problem  we  simulate  the  Wiener  smoother.  First  generate  N  =  50 
samples  of  a  signal  5[n],  which  is  an  AR  random  process  (assumes  that  U[n\ 
is  white  Gaussian  noise)  with  a  =  0.25  and  —  0.5.  Remember  to  set  the 
initial  condition  S[—  1]  ~  A/*(0,  cr^/(l  —  a2).  Next  add  white  Gaussian  noise 
W[n]  with  (Tyy  =  1  to  the  AR  random  process  realization.  Finally,  use  the 
MATLAB  code  in  the  chapter  to  smooth  the  noise-corrupted  signal.  Plot  the 
true  signal  and  the  smoothed  signal.  How  well  does  the  smoother  perform? 

18.15  (w)  To  see  that  the  linear  prediction  equations  of  (18.28)  cannot  be  solved 
directly  using  ^-transforms,  take  the  ^-transform  of  both  sides  of  the  equation. 
Next  solve  for  R^z)  =  Z{h[k]}.  Explain  why  the  solution  for  the  predictor 
cannot  be  correct. 

18.16  (t)  In  this  problem  we  rederive  the  optimal  one-step  linear  predictor  for  the 
AR  random  process  of  Example  18.5.  Assume  that  X[no  + 1]  is  to  be  predicted 
based  on  observing  the  realization  of  {X[no],X[no  —  1],...}.  The  random 
process  X[n\  is  assumed  to  be  an  AR  random  process  described  in  Example 

A 

18.5.  Prove  that  A[no  +  1]  =  a  A  [no]  satisfies  the  orthogonality  principle, 

making  use  of  the  result  that  E[U[no  +  l]A[no  —  &]]  =  0  for  k  =  0, 1, _ The 

latter  result  says  that  “future”  samples  of  U[n]  must  be  uncorrelated  with  the 
present  and  past  samples  of  X[n].  Explain  why  this  is  true.  Hint:  Recall  that 
for  an  AR  random  process  A [n]  can  be  rewritten  as  X[n]  =  o  alU[n  —  /]. 

18.17  (w)  For  the  AR  random  process  described  in.  Example  18.5  show  that  the 
minimum  MSE  for  the  optimal  predictor  A  [no  +  1]  =  a  A  [no]  is  given  by 
msemin  =  r\- [0] ( 1  —  a2).  Use  this  to  explain  why  the  results  shown  in  Figure 
18.7  are  reasonable. 

18.18  (^)  (w)  Express  the  minimum  MSE  given  in  the  previous  problem  in  terms 
of  rjv[0]  and  the  correlation  coefficient  between  A  [no]  and  A  [no  +  1].  What 
happens  to  the  minimum  MSE  if  the  correlation  coefficient  magnitude  ap¬ 
proaches  one  and  also  if  it  is  zero? 

18.19  (c)  Consider  an  AR(2)  random  process  given  by  A[n]  =  — r2A[n  —  2]  +  U[n], 

where  Z7[n]  is  white  Gaussian  noise  with  variance  cr27  and  0  <  r  <  1.  This 
random  process  follows  from  (18.33)  with  p  =  2  and  a[l]  =  0,  a[2]  =  — r2. 
The  ACS  for  this  random  process  can  be  shown  to  be  rx[k ]  =  (<4/(1  - 
rA))r  1*1  cos(&7r/2)  [Kay  1988].  Find  the  optimal  one-step  linear  predictor  based 
on  the  present  and  past  samples  of  X[n\.  Next  perform  a  computer  simulation 
to  see  how  the  predictor  performs.  Consider  the  two  cases  r  =  0.5,  =  1  —  r4 

and  r  =  0.95,  o\j  —  1  —  r4  so  that  the  average  power  in  each  case  is  the 
same  (rx[0]  —  1).  Generate  150  samples  of  each  process  and  discard  the  first 
100  samples  to  make  sure  the  generated  samples  are  WSS.  Then,  plot  the 
realization  and  its  predicted  values  for  each  case.  Which  value  of  r  results  in 
a  more  predictable  process? 
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18.20  (t)  Derive  the  Wiener-Hopf  equations  given  by  (18.36)  and  the  resulting  min¬ 
imum  MSE  given  by  (18.37)  for  the  finite  length  predictor. 

18.21  (f)  For  M  =  1  solve  the  Wiener-Hopf  equations  given  by  (18.36)  to  find  h[ 0]. 
Relate  this  to  cov(X,  Y)/var(X)  used  in  the  prediction  of  Y  given  X  —  x. 

18.22  (^)  (f)  The  MA  random  process  described  in  Example  18.6  and  given  by 
X[n]  =  U[n\  —  bU[n  —  1]  has  as  its  ACS  for  —  1 

(  1  +  b2  k  =  0 

rx[k\  —  <  —b  k  =  1 

{  0  k  >  2. 

For  M  =  2  solve  the  Wiener-Hopf  equations  to  find  this  finite  length  predictor 
and  then  determine  the  minimum  MSE.  Compare  this  minimum  MSE  to  that 
of  the  infinite  length  predictor  given  in  Example  18.6. 

18.23  (f)  It  is  desired  to  predict  white  noise.  Solve  the  Wiener-Hopf  equations  for 
rx[k]  =  Gx^k]  and  explain  your  results. 

18.24  (^)  (f,c)  For  the  MA  random  process  X[n]  =  U[n ]  —  \U[n  —  1]  where  U[n] 
is  white  Gaussian  noise  with  =  1  find  the  optimal  finite  length  predictor 

A 

X[no  +  1]  =  h[0]X[no]  +  h[l]X[no  —  1]  and  the  corresponding  minimum  MSE. 
Next  simulate  the  random  process  and  compare  the  estimated  minimum  MSE 
with  the  theoretical  one.  Hint:  Use  your  results  from  Problem  18.22. 

18.25  (f)  Consider  the  prediction  of  a  randomly  phased  sinusoid  whose  ACS  is 
rx[k]  =  cos(27r/oA;).  For  M  —  2  solve  the  Wiener-Hopf  equations  to  determine 
the  optimal  linear  predictor  and  also  the  minimum  MSE.  Hint:  You  should  be 
able  to  show  that  the  minimum  MSE  is  zero.  Use  the  trigonometric  identity 
cos(2  0)  =  2cos2(0)  —  1. 

18.26  (t)  In  this  problem  we  consider  the  L-step  infinite  length  predictor  of  an  AR 
random  process.  Let  the  predictor  be  given  as 

oo 

X[n0  +  L\  =  h[k]X[no  -  k] 

k= 0 

and  show  that  the  linear  equations  to  be  solved  to  determine  the  optimal  h[ky s 
are 

oo 

rx[l  +  L]  =  h[k]rx[l  —  k]  l  =  0, 1, ...  . 

k= o 

Next  show  that  the  minimum  MSE  is 

OO 

msemjn  =  r  x  [0]  h0pt[k]rx[k  +  L\. 

k- o 
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Finally,  for  an  AR  random  process  with  ACS  rx  [A;]  =  (ct2//(1  —  a2))a  fc  show 
that 

X[n0  +  L]  =  aLX[n0 } 
msemin  =  r x  [0]  ( 1  -  a2L ) 

for  a  predictor  based  on  {Jf  [no],  X[no  —  1], . . .}.  To  do  so  assume  that  h[k\  =  0 
for  k  >  1  and  show  that  the  equations  can  be  satisfied  by  choosing  h[0]. 
Explain  what  happens  to  the  quality  of  the  prediction  as  L  increases  and  why. 

18.27  (s^/)  (t)  In  this  problem  we  consider  the  interpolation  of  a  random  process 
using  a  sample  on  either  side  of  the  sample  to  be  interpolated.  We  wish  to 

A 

estimate  or  interpolate  A  [no]  using  A  [no]  =  h[— l]A[no  + 1]  +  h[l]A[no  —  1]  for 
some  impulse  response  values  h{—  l],/i[l].  Find  the  optimal  impulse  response 
values  by  minimizing  the  MSE  of  the  interpolated  sample  if  A[n]  is  the  AR 
random  process  given  by  X[n]  =  aX[n  —  1]  +  U[n].  Does  your  interpolator 
average  the  samples  on  either  side  of  A  [no]?  What  happens  as  a  1  and  as 

Q  — 0? 

18.28  (f)  An  LTI  system  has  the  impulse  response  h(r)  =  exp(— r)  for  r  >  0  and  is 
zero  for  r  <  0.  If  continuous-time  white  noise  with  ACF  ru{r)  =  (Nq/2)S(t) 
is  input  to  the  system,  what  is  the  PSD  of  the  output  random  process?  Sketch 
the  PSD. 

18.29  )  (f)  An  LTI  system  has  the  impulse  response  h(r)  —  1  for  0  <  r  <  T 
and  is  zero  otherwise.  If  continuous-time  white  noise  with  ACF  ry  (r)  = 
(No/2)6(t)  is  input  to  the  system,  what  is  the  PSD  of  the  output  random 
process?  Sketch  the  PSD. 

18.30  (f)  A  filter  with  frequency  response  H(F)  —  exp(—j2nFTo)  is  used  to  filter  a 
WSS  random  process  with  PSD  Px(F).  What  is  the  PSD  at  the  filter  output 
and  why? 

18.31  (t)  Prove  that  if  a  continuous-time  white  noise  random  process  with  ACF 
tu{t)  =  (No/2)6(t)  is  input  to  an  LTI  system  with  impulse  response  h(r), 
then  the  ACF  of  the  output  random  process  is 

tv  r°° 

rxir)  =  ~7r  h(t)h(t  +  r)dt. 

^  J — oo 

18.32  (^)  (w)  An  RC  electrical  circuit  with  frequency  response 

H(F) 

is  used  to  filter  a  white  noise  random  process  with  ACF  rj/(r)  =  (Nq/2)S(t). 
Find  the  total  average  power  at  the  filter  output.  Is  it  infinite?  Hint:  See 
previous  problem. 


1  /RC 

1  /RC  +  j2irF 
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18.33  (t)  Two  continuous-time  WSS  zero  mean  random  processes  X(t )  and  Y(t) 
are  uncorrelated,  which  means  that  E[X (t\)Y (£2)]  =  0  for  all  t\  and  £2-  Is  the 
sum  random  process  Z(t)  =  X(t)  +  Y ( t )  also  WSS,  and  if  so,  what  is  its  ACF 
and  PSD? 

18.34  (c)  In  this  problem  we  compare  the  periodogram  spectral  estimator  to  one 
based  on  an  AR(2)  model.  This  assumes,  however,  that  the  AR  model  is 
an  accurate  one  for  the  random  process.  First  generate  N  =  50  samples  of 
a  realization  of  the  AR(2)  random  process  described  in  Problem  18.19  with 
r  =  0.5  and  crjj  =  1  —  r4.  Next  plot  the  periodogram  of  the  realization  (see 
Section  17.6).  Using  the  estimate  of  the  ACS  given  in  (18.49)  solve  the  Yule- 
Walker  equations  of  (18.47)  for  p  =  2  and  then  find  <r^  from  (18.48).  Finally, 
plot  the  estimated  PSD  given  by  (18.50)  and  compare  it  to  the  periodogram 
as  well  as  the  true  PSD.  You  may  also  wish  to  print  out  a[l]  and  a[ 2]  and 
compare  them  to  the  theoretical  values  of  a[  1]  =  0  and  a[ 2]  =  — r2  =  —0.25. 
Hint:  You  can  use  the  MATLAB  code  given  in  Section  18.7. 


Appendix  18A 


Solution  for  Infinite  Length 
Predictor 


The  equations  to  be  solved  for  the  one-step  predictor  are  from  (18.28) 

oo 

rx[l  +  1]  =  ^2  h[k\rx[l  -  k\  l  =  0, 1, . . . 

k= 0 

and  the  minimum  MSE  can  be  written  from  (18.29)  as 

oo 

nisemin  =  rx [0]  ^  ^  hopt[k]rx[  1  k\. 

k= o 

Now  let  n  =  l  +  1  in  (18A.1)  so  that 

oo 

rx[n)  =  —  1  —  k\  n  =  1,2,... 

k= o 
oo 

=  -  lVx[n  -  j ]  (let  j  =  k  +  1) 

1 

and  also  let  j  =  fc  +  1  in  (18A.2)  to  yield 


oo 

rx [0]  =  ^2 h\j  -  1  ]rx[~j]  +  msemin 
j= i 

where  we  drop  the  “opt”  on  /iopt[A:]  since  h[k]  and  msemin  are  unknowns 
wish  to  solve  for.  Then  combining  (18A.3)  and  (18A.4)  we  have 

OO 

rx[n]  =  '22,h\j  -  1  ]rx[n  -  j }  +  msemin<5n0  n  =  0, 1, 


(18A.1) 


(18A.2) 


(18A.3) 


(18A.4) 
that  we 


•  ■  • 
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where  5no  =  1  for  n  =  0  and  <5no  =  0  for  n  >  1.  Next  divide  both  sides  by  msemjn  to 
yield 

rx[n]  A  h[j  -  1] 


L  =  £ 


mSGmjn  msemin 

3= 1 


rx[n  -  j]  +  S 


nO 


n  =  0, 1, . . .  . 


Let 


9\j] 


l/msemin  j  —  0 

-h[j  -  l]/msemin  j  =  1,2,... 


(18A.5) 


so  that  the  equations  become 


oo 


rx[n]g[0]  =  -  ]T^b>x[n  -  j]  +  $no 

j= i 


or 


oo 


^2g[j]rx[n- j]  =  Sn0  n  =  0,1,.... 


(18A.6) 


j= o 


Now  if  (18A.6)  can  be  solved  for  g\j],  then  /i[i],msemin  can  then  be  found  from 
(18A.5).  Note  that  (18A.6)  is  a  discrete-time  convolution  that  holds  for  n  >  0.  We 
therefore  need  to  find  a  causal  sequence  g[n]  (since  the  sum  in  (18A.6)  is  only  over 
j  >  0),  which  when  convolved  with  rx[n]  yields  1  for  n  =  0  and  0  for  n  >  0.  Note 
that  the  values  of  g[n]  *rx[n]  for  n  <  0  are  unspecified  by  the  equations.  Hence, 
g[n]  *rx[n]  must  be  an  anticausal  sequence  to  be  a  solution  of  (18A.6).  This  can 
easily  be  solved  if 


oo 


Vx(z)  =  ^2  rx[n]z 


—n 


k=— oo 


can  be  written  as 


Vx{z) 


a 


u 


A(z)A(z~1) 


(18A.7) 


where 


00 


A(z) 


£  «[*]. 


k 


k= 1 


has  all  its  zeros  within  the  unit  circle  of  the  z  plane.  Now  1/A(z)  is  the  ^-transform 
of  a  causal  sequence.  This  is  because  if  all  the  zeros  of  A(z)  are  within  the  unit 
circle,  then  all  the  poles  of  1/A(z)  are  within  the  unit  circle.  Thus,  the  ^-transform 
1/A(z)  must  converge  on  and  outside  of  the  unit  circle.  Also,  then  1/A(z~l)  is  the 
^-transform  of  an  anticausal  sequence.  Assuming  this  is  possible  (18A.6)  becomes 


Z^{Q(z)Vx{z)}  =  Z-UGiz) 


a 


u 


A(z)A(z  x) 


n  =  0 
n  >  0 
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where  Q(z)  is  the  ^-transform  of  g[n]  and  Z  1  denotes  the  inverse  ^-transform.  Now 
if  we  choose 

g{z)  =  ^  (18A.8) 


a 


u 


then 


-l 


S(z) 


a 


U 


A(z)A(z~1) 


=  z 


-1 

1 

0 


A*-1) 

n  —  0 
n  >  0 


since  1  / A(z~l)  is  the  ^-transform  of  an  anticausal  sequence,  and  the  equations  are 
satisfied.  The  inverse  ^-transform  for  n  =  0  has  been  obtained  by  using  the  initial 
value  theorem  [Jackson  1991]  which  says  that  for  an  anticausal  sequence  x[n\ 


o 


xVAz 


—n 


TI—  —  00 


0 


=  lim  ^  x[n\z  n  —  #[0], 


n= 0 


n=— oo 


Therefore,  we  have  that 


A*-1) 


lim 


n=0  *->o  Aiz-1) 


1. 


The  solution  for  g[n]  is  from  (18A.8) 


g[n]  =  Z 


-l  f  A*) 


a 


u 


1/cr^  n  —  0 

— a[n]/<j^  n  >  1 


and  using  (18 A. 5) 


msemin 

m  - !] 

msemin 


=  »[0]  = 


G 


=  g[j}  =  - 


U 

«b1 


G 


3  >  1 


u 


Finally,  we  have  the  result  that 


h[n]  =  a[n  +  1]  n  =  0, 1, . . . 


msemin  —  Gy 


Chapter  19 


Multiple  Wide  Sense  Stationary 
Random  Processes 

19.1  Introduction 

In  Chapters  7  and  12  we  defined  multiple  random  variables  X  and  Y  as  a  mapping 
from  the  sample  space  S  of  the  experiment  to  a  point  (x:  y)  in  the  x-y  plane.  We 
now  extend  that  definition  to  be  a  mapping  from  S  to  a  point  in  the  x-y  plane  that 
evolves  with  time,  and  denote  that  point  as  (x[n],y[n\)  for  —  oo  <  n  <  oo.  The 
mapping,  denoted  either  by  (X[ri\,  Y[n ])  or  equivalently  by  [X[n]  Y [n]]T,  is  called  a 
jointly  distributed  random  process.  An  example  is  the  mapping  from  a  point  at  some 
geographical  location,  where  the  possible  choices  for  the  location  constitute  <S,  to  the 
daily  temperature  and  pressure  at  that  point  or  (T[n],  P[n]).  Instead  of  treating  the 
random  processes,  which  describe  temperature  and  pressure,  separately,  it  makes 
more  sense  to  analyze  them  jointly.  This  is  especially  true  if  the  random  processes 
are  correlated.  For  example,  a  drop  in  barometric  pressure  usually  indicates  the 
onset  of  a  storm,  which  in  turn  will  cause  a  drop  in  the  temperature.  Another 
example  of  great  interest  is  the  effect  of  a  change  in  the  Federal  Reserve  discount 
rate,  which  is  the  percentage  interest  charged  to  banks  by  the  federal  government, 
on  the  rate  of  job  creation.  It  is  generally  assumed  that  by  lowering  the  discount 
rate,  companies  can  borrow  money  more  cheaply  and  thus  invest  in  new  products 
and  services,  thus  increasing  the  demand  for  labor.  The  jointly  distributed  random 
processes  describing  this  situation  are  J[n],  the  daily  discount  interest  rate,  and 
J[n],  the  daily  number  of  employed  Americans.  Many  other  examples  are  possible, 
encompassing  a  wide  range  of  disciplines. 

In  this  chapter  we  extend  the  concept  of  a  wide  sense  stationary  (WSS)  ran¬ 
dom  process  to  two  jointly  distributed  WSS  random  processes.  The  extension  to 
any  number  of  WSS  random  processes  can  be  found  in  [Bendat  and  Piersol  1971, 
Jenkins  and  Watts  1968,  Kay  1988,  Koopmans  1974,  Robinson  1967].  Multiple 
random  process  theory  is  known  by  the  synonymous  terms  multivariate  random 
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processes ,  multichannel  random  processes ,  and  vector  random  processes.  Also,  the 
characterization  of  the  random  processes  at  the  input  and  output  of  an  LSI  system 
is  explored.  We  will  find  that  the  extensions  are  much  the  same  as  in  going  from  a 
single  random  variable  to  two  random  variables ,  especially  since  the  definitions  are 
based  on  samples  of  the  random  process,  which  themselves  are  random  variables. 
As  in  previous  chapters  our  focus  will  be  on  discrete-time  random  processes  but 
the  analogous  concepts  and  formulas  for  continuous-time  random  processes  will  be 
summarized  later. 

19.2  Summary 

Two  random  processes  are  jointly  WSS  if  they  are  individually  WSS  (satisfy  (19.1)- 
(19.4))  and  also  the  cross-correlation  given  by  (19.5)  does  not  depend  on  n.  The  se¬ 
quence  given  by  (19.5)  is  called  the  cross-correlation  sequence.  The  cross-correlation 
sequence  has  the  properties  given  in  Property  19.1-19.4,  which  differ  from  those  of 
the  ACS.  Jointly  WSS  random  processes  are  defined  to  be  uncorrelated  if  (19.12) 
holds.  The  cross-power  spectral  density  is  defined  by  (19.13)  and  is  evaluated  using 
(19.14).  It  has  the  properties  given  by  Property  19.5-19.9,  which  differ  from  those 
of  the  PSD.  The  correlation  between  two  jointly  WSS  random  processes  can  be  mea¬ 
sured  in  the  frequency  domain  using  the  coherence  function  defined  in  (19.20).  The 
ACS  and  PSD  for  the  sum  of  two  jointly  distributed  WSS  random  processes  is  given 
in  Section  19.5.  If  the  random  processes  are  uncorrelated,  then  the  ACS  and  PSD 
of  the  sum  random  process  are  given  by  (19.25)  and  (19.26),  respectively.  For  the 
filtering  operation  shown  in  Figure  19.2a  the  cross-correlation  sequence  is  given  by 
(19.27)  and  the  cross-power  spectral  density  by  (19.28).  For  the  filtering  operation 
shown  in  Figure  19.2b  the  cross-correlation  sequence  is  given  by  (19.29)  and  the 
cross-power  spectral  density  by  (19.30).  The  corresponding  definitions  and  formulas 
for  continuous-time  random  processes  are  given  in  Section  19.6.  Estimation  of  the 
cross-correlation  sequence  is  discussed  in  Section  19.7  with  the  estimate  given  by 
(19.46).  Finally,  an  application  of  cross-correlation  to  brain  physiology  research  is 
described  in  Section  19.8. 

19.3  Jointly  Distributed  WSS  Random  Processes 

We  will  denote  the  two  discrete-time  random  processes  by  X[n]  and  Y[n]  for  — oo  < 
n  <  oo.  Of  particular  interest  will  be  the  extension  of  the  concept  of  wide  sense 
stationarity  from  one  to  two  random  processes.  To  do  so  we  first  assume  that  each 
random  process  is  individually  WSS ,  which  is  to  say  that 


Vx[n] 

=  E[X[n}]=fxx 

(19.1) 

rx[k] 

=  E[X[n]X[n  +  k]] 

(19.2) 

/Lty[n] 

=  E\Y[n]]  =  (jLY 

(19.3) 

ry[k] 

=  E[Y[n]Y[n  +  k]] 

(19.4) 
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or  the  first  two  moments  do  not  depend  on  n.  For  the  concept  of  wide  sense  sta- 
tionarity  to  be  useful  in  the  context  of  two  random  processes,  we  require  a  further 
definition.  To  motivate  it,  consider  the  situation  in  which  we  add  the  two  ran¬ 
dom  processes  together  and  wish  to  determine  the  overall  average  power.  Then,  if 
Z[n\  =  X[n]  +  y[n],  we  need  to  find  E[Z2[n]].  Proceeding  we  have 

E[Z2[n ]]  -  E[(X[n]  +  Y[n})2] 

=  E[X2[n\]  +  E[X[n]Y[n]}  +  E[Y[n]X[n}\  +  E[Y2[n ]] 

=  rx  [0]  +  2  E[X[n]Y[n]}  +  ry[0]. 

To  complete  the  calculation  we  require  knowledge  of  the  joint  moment  J5[X[n]y[n]]. 
If  it  does  not  depend  on  n,  then  FJ[Z2[n]]  will  likewise  not  depend  on  n.  More 
generally,  if  we  were  to  compute  E[Z[n]Z[n  +  &]],  then  we  would  require  knowledge 
of  E\X[n]Y[n  +  k]]  and  so  we  will  assume  that  the  latter  does  not  depend  on  n. 
Therefore,  with  this  assumption  we  can  now  define 

rx,y[k]  =  E[X[n]Y[n  +  k]]  k  =  . . . ,  -1, 0, 1, . . .  .  (19.5) 

This  new  sequence  is  called  the  cross-correlation  sequence  (CCS).  Returning  to  our 
average  power  computation  we  can  now  write  that 

E[Z2[n ]]  =  rx[0]  +  2rjv,y[0]  4-  ry[0] 

and  the  average  power  is  seen  not  to  depend  on  n.  Note  also  from  the  definition  of 
the  CCS,  that  the  ACS  is  just  rx,x[k]- 

If  X[n]  and  Y[n\  are  WSS  random  processes  and  a  CCS  can  be  defined 
(E[X[n]Y[n  +  k]]  not  dependent  on  n),  then  the  random  processes  are  said  to  be 
jointly  wide  sense  stationary.  In  summary,  for  the  two  random  processes  to  be 
jointly  WSS  we  require  the  conditions  (19.1)— (19.5)  to  hold.  An  example  follows. 

Example  19.1  —  CCS  for  WSS  random  processes  delayed  with  respect  to 
each  other 

Let  X[n\  be  a  WSS  random  process  and  let  Y[n]  be  a  delayed  version  of  X[n\  so 
that  Y[n]  =  X[n  —  no].  Then,  to  determine  if  the  random  processes  are  jointly  WSS 
we  have 


E[X[n}} 
E[Y[n}} 
E[X[n]X[n  +  k]] 
E[Y[n]Y[n  +  k]] 
E[X[n]Y[n  +  k]} 


—  Hx 

=  E[X[n  -  n0]]  =  fix 
=  rx  [fc] 

=  E[X[n  —  no]X[n  +  k  —  no]]  =  rx[k] 

=  E[X[n]X[n  +  k  -  n0]]  =  rx[k  -  n0] 


(19.6) 


all  of  which  follow  from  our  definition  of  Y[n]  and  the  assumption  that  X[n]  is  WSS. 
Note  that  E[X[n]Y[n  +  fc]]  does  not  depend  on  n  and  so  a  CCS  can  be  defined.  It 
is  given  by  (19.6)  as 


rX,y[k]  =  rx[k  —  n0  . 


(19.7) 
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Since  all  the  first  and  second  moments  do  not  depend  on  n,  the  random  processes 
are  jointly  WSS. 

❖ 

We  will  henceforth  assume  that  X[n\  and  Y[n]  are  jointly  WSS  unless  stated  oth¬ 
erwise.  Prom  the  previous  example  it  is  observed  that  the  CCS  has  very  different 
properties  than  the  ACS.  Unlike  the  ACS,  the  CCS  does  not  necessarily  have  its 
maximum  value  at  k  =  0.  In  the  previous  example,  the  maximum  of  the  CCS  oc¬ 
curs  at  k  =  no  (see  (19.7)).  Also,  in  general  we  do  not  have  rx,y[—k\  —  rx,y[k\ 
or  the  CCS  is  not  symmetric  about  k  =  0.  In  the  previous  example,  we  have  from 
(19.7) 


rx,Y[~k]  =  rx[-k-n0\ 

=  rx  [k  +  n0]  ^  rx  [k  -  n0]  =  rx,Y  [&]  • 

Furthermore,  even  though  the  CCS  is  symmetric  about  k  =  no  in  the  previous 
example,  it  need  not  be  symmetric  at  all. 


CCS  asymmetry  requires  vigilance. 


Since  the  CCS  is  not  symmetric,  in  contrast  to  the  ACS,  one  must  be  careful. 
The  cross-second  moment  E[X[m]Y[n ]],  where  X[n]  and  Y[n]  are  jointly  WSS,  is 
expressed  in  terms  of  the  CCS  as  rx,y[n  —  m],  not  rx,Y[m  ~  n\-  To  determine  the 
argument  k  of  the  CCS  for  rx,Y[k],  always  take  the  index  of  the  Y  random  variable 
and  subtract  the  index  of  the  X  random  variable.  For  example,  £?[X[3]y[l]]  = 
rx,y[l  ~  3]  =  tx,y[— 2].  This  is  especially  important  in  light  of  the  fact  that  the 
definition  of  the  CCS  is  not  standard.  Some  authors  use  rx,Y [&]  =  E[X[n]Y[n  —  it]], 
which  will  produce  a  CCS  that  is  “flipped  around”  in  k,  relative  to  our  definition. 

A 

We  give  one  more  example  and  then  summarize  the  properties  of  the  CCS. 

Example  19.2  -  Another  calculation  of  the  CCS 

Assume  that  X[n\  =  U\n]  and  Y [n]  =  U[n]  +  2U[n  —  1],  where  U[n]  is  white  noise 
with  variance  afj  =  1.  Thus,  X[n\  is  a  white  noise  random  process  and  Y [n]  is  a 
general  MA  random  process,  i.e.,  no  Gaussian  assumption  is  made.  Then,  it  is  easily 
shown  that  pxM  =  /iy[n]  =  0,  rx[k]  =  <$[&],  and 

(  5  k  =  0 
ry[k ]  =  <  2  k  —  ±1 

(  0  otherwise 

so  that  X[n\  and  Y [n]  are  individually  WSS.  Now  computing  the  cross-second  mo- 
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ment,  we  have 


E[X[n]Y[n  +  k]  = 


E[U[n\(U[n  +  k]  +  2U[n  +  k-  1])] 
ru[k]  +  2 ru[k  -  1] 

<5[fc]  +  2  S[k  —  1] 


and  it  is  seen  to  be  independent  of  n.  Hence,  the  CCS  is 


fx,Y[k]  =  5  [A;]  +  26[k  —  1 


and  the  random  processes  are  jointly  WSS.  The  ACSs  and  the  CCS  are  shown  in 
Figure  19.1.  We  observe  that  rx,y[— k]  /  rx,y[k]  and  that  the  maximum  does  not 
occur  at  k  =  0.  We  can  assert,  however,  that  the  maximum  must  be  less  than  or 
equal  to  \f  r  \  [0]'ry  [0]  since  by  the  Cauchy-Schwarz  inequality  (see  Appendix  7A) 


\rx,rM  =  \E[X[n]Y[n  +  k}]\ 

<  V E[X2[n]]E[Y2[n  +  A:]] 
=  V  rx  [0]ry  [0] . 


For  this  example  we  see  that 


\rx,Y [fc]|  ^  Vl  •  5  =  \/5. 


0 

We  now  summarize  the  properties  (or  more  appropriately  the  nonproperties)  of  the 
CCS. 

Property  19.1  —  CCS  is  not  necessarily  symmetric. 

rx,Y[-k]  ±  rx,Y[k]  (19.8) 

□ 

Property  19.2  —  The  maximum  of  the  CCS  can  occur  for  any  value  of  k. 

□ 


Property  19.3  -  The  maximum  value  of  the  CCS  is  bounded. 


(19.9) 


J%y[fc]|  <  \/rx  [0]ry  [0] 


□ 
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k 

(c) 


Figure  19.1:  Autocorrelation  and  cross-correlation  sequences  for  Example  19.2. 

A  fourth  property  that  is  useful  arises  by  considering  E\Y[n\X[n  +  A:]],  which  is 
the  cross-second  moment  with  A'  [n]  and  Y  [n]  interchanged.  Assuming  jointly  WSS 
random  processes,  this  moment  becomes 

E[Y[n\X[n  +  k}\  =  E[X[n  +  k]Y[n]] 

—  E[X[m]Y[m  —  A:]]  (let  m  =  n  +  k) 

—  rX.Y [~A;]  (from  definition  of  CCS). 

Therefore,  E[Y[n]X[n  +  A;]]  does  not  depend  on  n  and  so  we  can  define  another 
cross-correlation  sequence  as 

ry,x[k]  =  E[Y[n]X[n  +  A]] 


(19.10) 
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and  it  is  seen  to  be  equal  to  rx,y[— k].  Thus,  as  our  last  property  we  have 

Property  19.4  -  Interchanging  X[n]  and  Y[n\  flips  the  CCS  about  k  =  0. 

ry,x[k]  =  rx,y[-k\  (19.11) 

□ 

Next,  we  define  the  concept  of  uncorrelated  jointly  WSS  random  processes.  Two 
zero  mean  jointly  WSS  random  processes  are  said  to  be  uncorrelated  if 

rx,y[k]  —  0  for  — oo  <  k  <  oo  (19.12) 

including  k  =  0.  (For  nonzero  mean  random  processes  the  definition  of  uncorrelated 
random  processes  is  that  rXy[k\  =  f^x^y  for  —  oo  <  k  <  oo.)  Of  course,  if  the 
random  processes  are  independent  so  that  E[X[n]Y[n  +  k]]  =  0  does  not  depend  on 
n,  then  they  must  be  jointly  WSS  as  well.  It  also  follows  from  Property  19.4  that 
if  the  random  processes  are  uncorrelated,  then  ry,x[k]  =  0  for  all  k.  An  example 
follows. 

Example  19.3  -  Uncorrelated  sinusoidal  random  processes 

Let  X[n\  =  cos(27r/on  +  ©i)  and  Y[n]  =  cos(27r/on  +  @2),  where  ©1  ~  £/(0, 27r), 
©2  ~  W(0, 27r),  and  ©1  and  ©2  are  independent  random  variables.  Then,  we  have 
seen  previously  that  X[n]  and  Y[n\  are  individually  WSS  (see  Example  17.4)  and 

rx,y[k]  =  E[X[n]Y[n  +  k]] 

=  EeuQ2[cos(2nfon  +  ©1)  cos(27r/0(n  +  k)  +  ©2)] 

=  E®1  [cos(27r/on  +  @1  )]Eq2 [cos(27r/o(n  +  k)  +  ©2)]  (independent  random 

variables  and  (12.30)) 

-  0 

since  each  random  sinusoid  has  a  zero  mean  (see  Example  16.11).  Thus,  the  random 
processes  are  uncorrelated  and  jointly  WSS.  Can  you  interpret  this  result  physically? 

0 

19.4  The  Cross-Power  Spectral  Density 

The  PSD  of  a  WSS  random  process  was  seen  earlier  to  describe  the  distribution  of 
average  power  with  frequency.  Also,  the  average  power  of  the  random  process  in  a 
band  of  frequencies  is  obtained  by  integrating  the  PSD  over  that  frequency  band. 
In  a  similar  vein  to  the  definition  of  the  PSD,  we  can  define  the  cross-power  spectral 
density  (CPSD)  of  two  jointly  WSS  random  processes  as 

1  17  M 

PxrU)  =  'X2MTTE  (  £  X[n]eXp(-j2T/„) 

L  \n=-M 

(19.13) 


jr  Y[n]  exp(-j27T/n) 


n——M 
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which  results  in  the  usual  PSD  if  Y[n]  =  X[n].  Using  a  similar  derivation  as  the  one 
that  resulted  in  the  Wiener-Khinchine  theorem,  it  can  be  shown  that  (see  Problem 
19.8) 

OO 

Px,y{})=  Y  rx,Y[k)  exp(—j2nfk).  (19.14) 

k=— oo 

It  is  less  clear  than  for  the  PSD  what  the  physical  significance  of  the  CPSD  is.  Prom 
(19.13)  it  appears  that  the  CPSD  will  be  large  when  the  Fourier  transforms  of  X [n] 
and  Y[n]  at  a  given  frequency  are  large  and  are  in  phase.  Conversely,  when  the 
Fourier  transforms  are  either  small  or  out  of  phase,  the  CPSD  will  be  small.  This  is 
confirmed  by  the  results  of  Example  19.3  in  which  the  sinusoidal  processes  have  all 
their  power  at  /  =  ±/0  since  Px{f )  =  Py(/)  =  ^S(f  +  f0 )  +  \8(f  -  /0).  However, 
because  they  have  phases  that  are  independent  of  each  other  and  can  take  on  values 
in  (0,27r)  uniformly,  rx,y[k]  —  0  and  therefore,  Px,y(f)  =  0.  On  the  other  hand, 
if  the  phase  random  variables  were  statistically  dependent,  say  ©i  =  ©2,  then  the 
CPSD  would  be  large  (see  Problem  19.9).  Another  example  follows. 

Example  19.4  -  CCS  for  WSS  random  processes  delayed  with  respect  to 
each  other  (continued) 

We  continue  Example  19.1  in  which  Y[n\  =  X[n  —  no]  and  X[n]  is  WSS.  We  saw 
that  the  CCS  is  given  by  rx,y[k]  =  rx[k  —  no].  Using  (19.14)  the  CPSD  is 

00 

Px,yU)=  Y  rx[k-n0]exp(-j2nfk) 

k=— oo 

and  letting  l  =  k  —  no  produces 

00 

Px,yU)  =  Y  rx[l]exp[-j2Trf{l  +  n0)] 

l—  —  OO 
OO 

=  Y  rx[l]exp(-j2nfl)exp(-j2'Kfn0) 

l—  —  OO 

=  Px(f)exp(-j2nfn0). 

It  is  seen  that  the  CPSD  is  a  complex  function  and  that  Px,y{—f )  7^  Px,y{f )• 
It  does  appear,  however,  that  Px,y{~f )  =  Pxy(f)  so  ^  ^as  symmetry 
properties 

\Px,Y{~f)\  =  \Px,Y(f)\ 

^ PxA-f )  =  ~^Px,Y{f)  (19-15) 

or  the  magnitude  of  the  CPSD  is  an  even  function  and  the  phase  of  the  CPSD  is  an 
odd  function.  This  result  is  indeed  true  as  we  will  prove  in  Property  19.6. 
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One  way  to  think  about  the  CPSD  is  as  a  correlation  between  the  normalized  Fourier 
transforms  of  X[n ]  and  Y[n ]  at  a  given  frequency.  From  (19.13)  we  see  that  if 


1  M 

X2M+i(f )  =  ^2M  +  1  53  X[n]exp{-j2nfn) 


71— —  M 

M 


Y2M+i(f )  =  ^2j^.  +  1  53  exp(-j27r/n) 


(19.16) 


71— —  M 


then 


PxAf)  =  Jim  E[X*2M+1(f)Y2M+i(f)]. 

M-+oo 


(19.17) 


This  is  a  correlation  between  the  two  complex  random  variables  and 

>2M+i(/)*  In  feet,  a  normalized  version  of  the  CPSD  is  a  complex  correlation  coef¬ 
ficient.  Indeed,  from  the  Cauchy-Schwarz  inequality  for  complex  random  variables 
(see  Appendix  7A  for  real  random  variables),  we  have  that  (recall  that  if  X  =  U +jV , 
then  E[X]  is  defined  as  E[X]  =  E[U]  +  jE[V]) 

\E[XZM+1(f)Y2M+1(f)}\  <  ^[|X2M+i(/)|2]^[|F2M+i(/)|2]  (19.18) 

and  therefore  as  M  -*  oo,  this  becomes  from  (19.17)  and  (17.30) 

\Px,r(f)\  <  y/Px(f)Pr(f).  (19.19) 

Thus,  if  we  normalize  the  CPSD  to  form  the  complex  function  of  frequency 


7 x,r{f) 


Px,yU ) 


VpajWT) 


(19.20) 


then  we  have  that  \jx,y{f)\  1-  The  complex  function  of  frequency  7j K,y(f)  is 

called  the  coherence  function  and  it  is  a  complex  correlation  coefficient.  It  measures 
the  correlation  between  the  Fourier  transforms  of  two  jointly  WSS  random  processes 
at  a  given  frequency.  As  an  example,  consider  the  random  processes  of  Example 
19.4.  Then 


7 X,Y(f) 


PxAf) 


VpxUJpyU) 

PxXf)  exp(— j27r/n0) 

VPx(f)Px(f) 

exp(-j'27r/n0) 


(since  Px(f)  >  0) 


The  magnitude  of  the  coherence  is  unity  for  all  frequencies,  meaning  that  the  Fourier 
transform  of  Y[n ]  at  a  given  frequency  can  be  perfectly  predicted  from  the  Fourier 
transform  of  X[n ]  at  the  same  frequency  since  Y[n]  =  X[n  —  no].  It  follows  that 
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Y2m+iU)  =  exp(-j27r/n0)X2M+i(/)  and  therefore  Y2M+i(f)  =  Jx,y(f)^2M+i(f) 
for  all  /.  Furthermore,  since  the  coherence  magnitude  is  unity  for  all  frequencies, 
the  prediction  of  the  frequency  component  of  Y[n]  is  perfect  for  all  frequencies  as 
well.  This  says  finally  that  Y[n ]  can  be  perfectly  predicted  from  X[n].  To  do  so 
just  let  Y[n]  =  X[n  +  no].  In  general,  we  will  see  later  that  if  Y[n]  is  the  output  of 
an  LSI  system  whose  input  is  X[n],  then  the  coherence  magnitude  is  always  unity. 
Can  you  interpret  Y[n\  =  X[n  —  no]  as  the  action  of  an  LSI  system?  Finally,  in 
contrast  to  perfect  prediction,  consider  the  CPSD  if  X[n]  and  Y[n]  are  zero  mean 
and  uncorrelated.  Then  since  rx,y[k\  =  0,  we  have  that  Px,y(/)  =  0  for  /,  and  of 
course  the  coherence  will  be  zero  as  well.  We  now  summarize  the  properties  of  the 
CPSD. 

Property  19.5  -  CPSD  is  Fourier  transform  of  the  CCS. 


00 

Px,y{f)  =  ^2  rx,y[k\  exp(-j2irfk) 

k — — 00 

Proof:  See  Problem  19.8. 


□ 


Property  19.6  -  CPSD  is  a  hermitian  function. 

A  complex  function  g(f)  is  hermitian  if  its  real  part  is  an  even  function  and  its 
imaginary  part  is  an  odd  function  about  /  =  0.  This  is  equivalent  to  saying  that 
g(-f)  =  9*(f)-  Thus, 

Px,y(~f)  =  Px,y(f)  (19-21) 

(see  also  (19.15)  which  is  valid  for  a  hermitian  function). 

Proof: 


oo 


PxAf) 


^2  rx,y[k]  exp(— j’27r/fc) 

k=—oc 


oo 


Px,y(~f)  =  rx,y[k]exP{j^fk) 


k=— oo 
oo 


oo 


=  ^2  rx,y[k]  cos(2n fk)  +  j  ^2  rx,y[k]  sm(2n fk) 


k— — oo 
oo 


k~—oo 

oo 


^2  rx,Y[k]cos(2nfk)  -  j  ^  rXy[k]  sin(27r/A:) 


<k=— oo 
00 


k=— oo 
* 


*22  rXtY[k]exp(-j2nfk) 

<k=— oo 

PxAf ) 


□ 
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Property  19.7  —  CPSD  is  bounded. 

\Px,Y(f)\  <  VPxUWJ)  (19.22) 

Proof:  See  argument  leading  to  (19.20). 

□ 


Property  19.8  -  CPSD  is  zero  for  zero  mean  uncorrelated  random  pro¬ 
cesses. 

If  X[n\  and  Y[n\  are  jointly  WSS  random  processes  that  are  zero  mean  and  uncor¬ 
related,  then  Px,y(f)  —  0  for  all  /. 

Proof:  Since  the  random  processes  are  zero  mean  and  uncorrelated,  rx,y[&]  =  0  by 
definition.  Hence,  the  CPSD  is  zero  as  well,  being  the  Fourier  transform  of  the  CCS. 

□ 


Property  19.9  -  CPSD  of  (Y[n],X[n])  is  the  complex  conjugate  of  the 
CPSD  of  (X[n],y[n]). 


Proof: 


Py,x(f)  =  Px,yU) 


(19.23) 


Py,xU)  = 


oo 

7:  rY,x[k]  exp(—j2irfk) 

k=—oo 

oo 

y  rx,Y[-k]exp(-j2irfk) 

k=—oo 

oo 

y  rx,Y[l]  exp  (j2irfl) 

l—  —  OO 

PxA-f ) 

P*x,yU ) 


(using  (19.11)) 


(using  (19.21)) 


□ 

We  conclude  this  section  with  one  more  example. 

Example  19.5  -  MA  Random  Process 

Let  Y[n ]  =  X[n\  —  bX[n  —  1],  where  X [ri]  is  white  Gaussian  noise  with  variance  cs\  ■ 
We  wish  to  determine  the  CPSD  between  the  input  X[n ]  and  output  Y[n]  random 
processes  (assuming  they  are  jointly  WSS,  which  will  be  borne  out  shortly).  (Are 
X[n]  and  Y[n\  individually  WSS?)  To  do  so  we  first  find  the  CCS  and  then  take  the 
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Fourier  transform  of  it.  Proceeding  we  have 


rx,y[k]  =  E[X[n]Y[n  +  k]] 

=  E[X[n](X[n  +  k]  -  bX[n  +  k  -  1])] 

=  E[X[n]X[n  +  k]]  -  bE[X[n]X[n  +  k  -  1]] 
=  rx[k]  —  brx[k  —  1] 


which  does  not  depend  on  n  and  hence  X  [n]  and  Y  [n]  are  jointly  WSS  with  the 
CCS 


rx,y[k]  =  oxS[k]  -  baxS[k  -  1]. 


The  CPSD  is  found  as  the  Fourier  transform  to  yield 


Px,r(f )  =  o2x  ~  b°x  exp(-j2nf) 
=  ^x(l-&exp(-j27r/)). 


❖ 

Note  that  in  the  previous  example  we  can  view  Y[n]  as  the  output  of  an  LSI  filter 
with  frequency  response  H(f )  =  1  —  6exp(— j2nf).  Therefore,  we  have  the  result 

Px,y(f)  —  H(f)ax.  (19.24) 

More  generally,  we  will  prove  in  the  next  section  that  if  X[n ]  is  the  input  to  an  LSI 
system  with  Y[n\  as  its  corresponding  output,  then  X[n ]  and  Y[n\  are  jointly  WSS 
and  Px,y(/)  —  H(f)Px(f)-  As  an  application  note,  if  the  input  to  the  LSI  system  is 
white  noise  with  o\  —  1,  then  Px,Y(f)  =  H(f).  To  measure  the  frequency  response 
of  an  unknown  LSI  system  one  can  input  white  noise  with  a  variance  equal  to  one 
and  then  estimate  the  CCS  from  the  input  and  observed  output  (see  Section  19.7). 
Upon  Fourier  transforming  that  estimate  one  obtains  an  estimate  of  the  frequency 
response.  Lastly,  since  Px,y(/)  =  H(f)  for  Px(f)  =  1,  it  is  clear  that  the  properties 
of  the  CPSD  should  mirror  those  of  a  frequency  response,  i.e.,  complex  in  general, 
hermitian,  etc. 

19.5  Transformations  of  Multiple  Random  Processes 

We  now  consider  the  effect  of  some  transformations  on  jointly  WSS  random  pro¬ 
cesses.  As  a  simple  first  example,  we  add  the  two  random  processes  together. 
Hence,  assume  X [n]  and  Y[n]  are  jointly  WSS  random  processes,  and  Z[n]  = 
X[n]  +  Y[n].  We  next  compute  the  first  two  moments.  Clearly,  we  will  have 
Hz[n}  =  nx[n]  +  fj,y[n]  =  nx  +  Hy  and 

rz[k]  =  E[Z[n]Z[n  +  k}\ 

=  E[{X{n]  +  Y[n]){X[n  +  k]  +  Y[n  +  k])] 

=  rx[k]  +  rx,y[k]  +  ry,x[k]  +  ry[k]  (assumed  jointly  WSS) 
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and  hence  Z[n ]  is  a  WSS  random  process.  Its  PSD  is  found  by  taking  the  Fourier 
transform  of  the  ACS  to  yield 

Pz(f)  =  Px(f)  +  Px,y(/)  +  Py,x{f)  +  Py(f)- 

If  in  particular  X[n\  and  Y[n ]  are  zero  mean  and  uncorrelated,  so  that  rx,y[k]  =  0 
and  hence  ry,x[k\  —  rx,Y[—k]  =  0  as  well,  we  have 

rz[k\  =  rx[k]  +  ry[k\  (19.25) 

Px(f)  =  Px(f)  +  Pr{f)-  (19.26) 

Another  frequently  encountered  transformation  is  that  due  to  filtering  of  a  WSS 
random  process  by  one  or  two  LSI  filters.  These  transformations  are  shown  in  Figure 
19.2.  For  the  transformation  shown  in  Figure  19.2a  we  already  know  from  Chapter 


Figure  19.2:  Common  filtering  operations. 


18  that  if  X[n\  is  WSS,  then  Y[n]  is  also  WSS  and  its  mean  and  ACS  are  easily 
found.  The  question  arises,  however,  as  to  whether  X[n\  and  Y[n]  are  jointly  WSS. 
To  answer  this  we  compute  E[X[n]Y[n  +  k]]  to  see  if  it  depends  on  n.  Proceeding, 
we  have  for  the  filtering  operation  shown  in  Figure  19.2a  with  h[k]  denoting  the 
impulse  response 


E[X[n]Y[n  +  k]]  =  E 


oo 


X[n]  Y  h[l]X[n  +  k-l] 

l— — OO 


oo 


=  Y  h[l\E[X[n]X[n  +  k  -  l]] 


l— — OO 

oo 


=  Y  hil]rx[k-l ] 


l=— OO 


and  we  see  that  it  does  not  depend  on  n.  Hence,  if  X[n]  is  the  input  to  an  LSI 
system  and  Y[n ]  is  its  corresponding  output ,  then  X[n\  and  Y[n]  are  jointly  WSS. 


654 


CHAPTER  19.  MULTIPLE  WSS  RANDOM  PROCESSES 


Also,  we  have  for  the  CCS 


OO 

rx,v[k]  =  Y  h[l]rx[k  -  1}  (19.27) 

l=—oo 

which  can  be  seen  to  be  a  discrete  convolution  or  rx,y[&]  =  h[k\*rx[k].  As  a  result, 
by  Fourier  transforming  the  CCS  we  obtain  the  CPSD  as 


PxHf)  =  H(f)Px(f)  (19.28) 


which  agrees  with  our  earlier  result  of  (19.24).  As  previously  asserted,  we  can  also 
now  prove  that  if  X[n ]  is  a  WSS  random  process  that  is  input  to  an  LSI  system  and 
Y[n]  is  the  output  random  process,  then  the  coherence  magnitude  is  one.  This  says 
that  Y[n]  is  perfectly  predictable  from  X[n],  which  upon  reflection  just  says  that  to 
predict  Y  [n]  we  need  only  pass  X[n ]  through  the  same  filter!  To  verify  the  assertion 
about  the  coherence  magnitude 


Px,Y(f) 

y/PxUJPyU) 

H(J)Px(J) 

VPx(f)\H(f)\ipx(f) 

H(f) 

\H(f)\ 

=  exp  {j<f>(f)) 


(using  (19.28)  and  (18.11)) 


where  </>(/)  is  the  phase  response  of  the  LSI  system  or  </>(/)  =  Z H(f )  Thus, 
hx,Y(f)\  =  1  (assuming  1/ (/)  /  0)  and  Y [n]  is  perfectly  predictable  from  X[n] 
as 

OO 

>>]=  E  h[k\X[n  —  k]  for  all  n 

k— — oo 

where  h[k]  is  the  impulse  response  of  H(f).  Also,  X[n)  can  be  perfectly  predicted 
from  Y[n]  as  one  might  expect  from  the  analogous  result  of  the  symmetry  of  the 
correlation  coefficient,  which  is  px,Y  =  Py,x  (see  Problem  19.21). 

Next  consider  the  transformation  depicted  in  Figure  19.2b.  The  input  random 
process  U[n]  is  WSS  so  that  X [n]  and  Y[n ]  are  individually  WSS  according  to  Theo¬ 
rem  18.3.1.  To  determine  if  they  are  jointly  WSS  we  again  compute  £?[A[n]y[n-|-A:]] 
to  see  if  it  depends  on  n.  Therefore,  with  hi  [A;] ,  h-2  [k]  denoting  the  impulse  responses, 
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and  H^f)  denoting  the  corresponding  frequency  responses 


E[X[n]Y[n  +  k]]  =  E 


OO 


oo 


E  hi[i]U[n-i]  E  h2[j}u[n  +  k-  j] 


l—  —  OO 

oo  oo 


j=-oo 


E  E  -  i]U[n  +  k  -  j]} 


i=—ooj=—oo 
oo  oo 


=  E  E  hMh2\jYu[k  +  i  -  j] 


i— — oo  j=—oo 


and  does  not  depend  on  n.  Hence,  X[n ]  and  Y [n]  are  jointly  WSS  and  the  CCS  is 


oo 


oo 


rx,r[k ]  =  E  E  h2[j]ru[k  +  i  -  j] 


I—  —  OO  J  —  —  OO 


g[k+i] 


where  g[n]  =  h2[n]  *ru[n].  Continuing  we  have 


oo 


rx,v[k ]  =  E  [i]g[k  +  i] 


l—  —  OO 

oo 


E  hi[-l]g[k-l] 

l— — OO 

hi[-k]  *g[k\ 


(let  l  —  — i ) 


so  that 

rx,y[i:]  =  h\[— k]  *  /i2[&]  (19.29) 

(this  should  be  reminscent  of  another  relationship  that  results  if  h\[k]  —  Ii2[k]  = 
h[k ]).  Upon  Fourier  transforming  both  sides  we  have  the  CPSD 

Px,r(f)  =  H{{f)H2U)Pu{f).  (19.30) 

An  interesting  observation  from  (19.30)  is  that  if  the  two  filters  have  nonoverlapping 
passbands,  as  shown  in  Figure  19.3,  then 

H*1(f)H2(f)  =  0  ~\<f<\ 

and  Px,yU)  =  0-  Taking  the  inverse  Fourier  transform  of  the  CPSD  produces  the 
CCS,  which  is  rx  ,y[k]  —  0  for  all  k.  Hence,  for  nonoverlapping  passband  filters  as 
shown  in  Figure  19.3  the  X [n]  and  Y [n]  random  processes  are  uncorrelated.  (Note 
that  because  of  the  nonoverlapping  passbands  we  must  have  nx  =  0  or  fiy  =  0.) 
Since  this  holds  for  any  filters  satisfying  the  nonoverlapping  constraint,  it  also  holds 
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f 


Figure  19.3:  Nonoverlapping  passband  filters. 


in  particular  for  any  narrowband  filters  with  nonoverlapping  passbands.  What  this 
says  is  that  the  the  Fourier  transform  of  a  WSS  random  process  U[n\  is  uncorre¬ 
lated  at  two  different  frequencies .  (Actually,  it  is  the  truncated  Fourier  transform 
or  U2M+1  (/)  =  (l/\/2 M  +  1)  u\n]  exP(”i27r/n),  which  is  required  for  ex¬ 

istence,  that  is  uncorrelated  at  different  frequencies  as  M  ^  00.)  This  is  because 
the  Fourier  transform  can  be  thought  of  as  resulting  from  filtering  the  random  pro¬ 
cess  with  a  narrowband  filter  and  then  determining  the  amplitude  and  phase  of  the 
resulting  sinusoidal  output.  The  spectral  representation  of  a  WSS  random  process 
is  based  upon  this  interpretation  (see  [Brockwell  and  Davis  1987]  and  also  Problem 
19.22). 


A 

WSS. 


Two  random  processes  can  be  individually  WSS  but  not  jointly 


All  the  examples  thus  far  of  individually  WSS  random  processes  have  also  resulted 
in  jointly  WSS  random  processes.  To  dispel  the  notion  that  this  is  true  in  general 
consider  the  following  example.  Let  A[n]  =  A  and  Y[n\  =  (— l)nA,  where  A  is  a 
random  variable  with  E[A\  =  0  and  var(A)  =  1.  Then,  /ix[n]  =  /iy[n]  =  0  and  it  is 
easily  shown  that  rx[k]  =  1  for  all  k  and  ry[k]  =  (— l)fe  for  all  k.  Therefore,  X[n] 
and  Y[n]  are  individually  WSS  random  processes  but  they  are  not  jointly  WSS  since 

E[X[n]Y[n  +  k]]  =  E[A2{- l)n+k]  =  (-l)n(-l)fc 

which  depends  on  n.  For  example,  since  X[0]  =  Y[2]  =  A  and  X[l]  =  — T[3]  =  A, 
we  have  that 


E[X[0]Y[2]\  =  E[A2}  =  1 
Wins]]  =  E{A(-A)]  =  -1 

so  that  the  cross-correlation  between  two  samples  spaced  two  units  apart  depends 

A 


on  n. 
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19.6  Continuous-Time  Definitions  and  Formulas 

Two  continuous-time  random  processes  X(t)  and  Y(t)  for  — oo  <  t  <  oo  are  jointly 
WSS  if  X(t)  is  WSS,  Y{t)  is  WSS,  and  we  can  define  the  cross-correlation  function 
(CCF)  as 

tx,y(t)  —  E[X(t)Y (t  +  r)}  —  oo  <  r  <  oo  (19.31) 

which  does  not  depend  on  t.  Some  properties  (actually  nonproperties)  of  the  CCF 
are 

Property  19.10  —  CCF  is  not  necessarily  symmetric  about  r  —  0. 


tx,y(t)  /  rx,Y(-r) 


(19.32) 

□ 


Property  19.11  —  The  maximum  of  the  CCF  can  occur  for  any  value 
of  r. 


□ 


Property  19.12  —  The  maximum  value  of  the  CCF  is  bounded. 


tx,y{t)  I  <  \/rx(0)ry(0) 


(19.33) 

□ 


Property  19.13  —  Interchanging  X(t)  and  Y(t)  flips  the  CCF  about  r  =  0. 

ry,x(r)  =  rx,y(-T )  (19.34) 

□ 

Two  zero  mean  jointly  WSS  continuous  random  processes  are  said  to  be  uncorrelated 
if  ^x,y(t)  —  0  for  — oo  <  r  <  oo. 

The  CPSD  for  two  jointly  WSS  random  processes  is  defined  as 


Px,y(f )  =  „lim 

1  — >oo  1 


and  is  evaluated  as 


•T/2 


X(t)  exp(— j2,KFt)dt  j  (  /  Y(t)  exp(— j2irFt)dt  ] 

-T/2  J  \J-T/ 2  J 

(19.35) 


/OO 

rx,Y{r)  exp(-j27rFr)dr. 

-OO 


(19.36) 


Some  properties  of  the  CPSD  follow.  The  proofs  are  similar  to  those  for  the  discrete 
time  case. 
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Property  19.14  —  CPSD  is  a  complex  and  hermitian  function. 

The  hermitian  property  is 


Px,y{~F)  =  PlY(F) 


(19.37) 

□ 


Property  19.15  —  CPSD  is  bounded. 

\Px,y(F)\  <  y/Px(F)Py(F)  (19.38) 

□ 


Property  19.16  —  CPSD  of  (Y(t),X(t))  is  the  complex  conjugate  of  the 
CPSD  of  (X(t),Y(t)). 


PY,x(F)  =  PIy(F)  (19.39) 

□ 

The  formulas  for  the  linear  system  configuration  corresponding  to  that  shown  in  Fig¬ 
ure  19.2a  are  (continuous-time  system  is  assumed  to  be  LTI  with  impulse  response 
h(r)  and  frequency  response  H(f )) 

tx,y(t)  =  h(T)*rx(r)  (19.40) 

Pxx(F)  =  H{F)Px{F)  (19.41) 


and  for  the  configuration  of  Figure  19.2b  (continuous-time  systems  are  assumed  to 
be  LTI  with  impulse  responses  /ii(r),  /^(t),  and  corresponding  frequency  responses 

h2 (/)) 


rx,r(r)  =  hi(-T)^h2{r)^ru(r)  (19.42) 

Pxy{F)  =  HZ(F)H2(F)Pu(F).  (19.43) 


An  example  of  great  practical  importance  is  given  next  to  illustrate  the  concepts 
and  formulas. 


Example  19.6  —  Measurement  of  Channel  Delay 

It  is  frequently  of  interest  to  be  able  to  measure  the  propagation  time  of  a  signal 
through  a  channel.  This  allows  one  to  determine  distance  if  the  speed  of  propaga¬ 
tion  is  known.  This  idea  forms  the  basis  for  the  global  positioning  system  (GPS) 
[Hofmann- Wellenhof,  Lichtenegger,  Collins  1992].  See  also  Problem  19.28  for  an¬ 
other  application.  To  do  so  we  transmit  a  WSS  random  process  X(t),  that  is  ban- 
dlimited  to  W  Hz  (meaning  that  Px(F)  —  0  for  |  F  >  IT)  through  a  channel  and 
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observe  the  output  of  the  channel  Y  (t).  We  furthermore  assume  that  the  channel  is 
modeled  as  an  LTI  system  with  frequency  response 

H(F)  _  (- W)  (UU4) 

Note  that  the  numerator  term  represents  a  delay  of  to  seconds,  sometimes  called  the 
propagation  or  bulk  delay,  and  the  term  jF/lp  {F)  =  1/(1  +  j2nF)  represents  a  low- 
pass  filter  response  since  H lp(0)  =  1  and  #lp {F)  -*  0  as  F  -*  oo.  A  question  arises 
as  to  how  to  choose  the  transmit  random  process  X(t)  so  that  we  can  accurately 
measure  the  delay  to  through  the  channel.  In  the  ideal  case  in  which  Y (t)  is  just  a 
delayed  replica  of  X(t )  or  Y(t)  —  X (t  —  to),  we  know  that  the  CCF  is 


rx,y(r)  =  E[X(t)Y(t  +  r)] 

=  E[X(t)X(t  +  r  —  t0)] 
=  rx(r-to). 


Since  the  ACF  has  a  maximum  at  lag  zero,  there  will  be  maximum  of  rx,y(T) 
at  r  =  to,  suggesting  that  the  location  of  this  maximum  can  be  used  to  measure 
the  delay.  But  when  the  channel  has  the  frequency  response  given  by  (19.44)  the 
maximum  of  the  CCF  may  no  longer  be  located  at  r  =  to-  To  see  why,  first  compute 
the  CCF  as 


/oo 

Px,Y (F)  exp(j2nFr)dF 

-OO 


— OO 
‘OO 


/oo 

H(F)PX  (F)  exp(j2nFT)dF 

-OO 

J°°  exp(-j2irFto) 


(inverse  Fourier  transform) 
(from  (19.41)) 


—  OO 
•OO 


1  +  j2itF 


Px{F)  exp(j27rFr)dF  (from  (19.44)) 


r  oo  i 

7-00  1+7 


j2irF 


Px{F)  exp(j2nF(T  -  t0))dF 


and  since  X  ( t )  is  assumed  to  be  bandlimited  to  W  Hz,  we  have 


r  1 

rx,y{r)  =  J  Y+j2nFPx^  exvti2nF(T  ~  *o))^- 


(19.45) 


If,  as  an  example,  we  choose  X(t)  to  be  bandlimited  white  noise  (see  Example  17.11) 
or  Px{F)  —  No/2  for  |F|  <  W  and  Px{F)  =  0  for  |F|  >  W,  then 


rx,y{r)  = 


N0  rW 


Lwrn^FeMi2’F{T~t°))dF- 
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To  evaluate  this  we  first  note  that 


T 


-l 


1  +  j2TvF 


•OO 


-OO 


1 


1  +  j2nF 


exp(j27rFt)dF  =  exp(— t)u(t) 


so  that  if  we  define  the  frequency  window  function 


G(F) 


_  /  1  \F\  <  W 
0  \F\>W 


then 


rx,Y(r) 


N0 


N0 


‘OO 


G(F) 


— OO 


1  +  j2nF 


exp(j27rF(r  —  to))dF 


g(t)  *  exp(-t)u(t)\t=T_to 


(convolution  in  time  yields 
multiplication  in  frequency) . 


where  g(t)  is  the  inverse  Fourier  transform  of  G(F).  We  have  chosen  to  express 
the  integral  in  the  time  domain  since  its  physical  significance  becomes  clearer.  In 
particular,  note  that  the  convolution  in  time  results  in  a  wider  pulse.  But 


g(t)  =  2  W 


sin(27 xWt) 
2i xWt 


(see  Example  17.11) 


so  that  using  a  convolution  integral,  we  have 


rx,y(T)  =  N0W 


=  N0W 


*oo 

—  OO 

*oo 

0 


This  is  shown  in  Figure  19.4  for  the  case  when  W  =  1  and  to  =  2  as  the  light  line 
and  has  been  normalized  to  have  a  maximum  value  of  1.  The  integral  has  been 
evaluated  numerically.  Note  that  the  maximum  does  not  occur  at  to  =  2  because 
the  phase  response  of  the  channel  has  added  a  time  delay.  To  remedy  this  problem 
we  can  insert  an  equalizing  filter  at  the  channel  output  whose  frequency  response  is 


Heq(F)  - 


1  +  j2nF  |F|  <  W 
0  \F\  >  W. 


Then,  we  have  for  the  CPSD  between  the  input  X(t)  and  output  random  process  of 
the  equalizer  Y ( t ) 


Px,y(F)  =  Heq(F)H(F)Px(F)  = 


^exp(-j2TTFt0)  \F\<W 
0  |F|  >  W. 


The  CCF  is  found  as  before  by  using  the  inverse  Fourier  transform 

/OO 

Px,y(f )  exp(j2nFT)dF 


—  OO 
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Figure  19.4:  Cross-correlation  functions  for  W  =  1  and  to  =  2.  Both  curves  are 
normalized  to  yield  one  at  their  peak.  The  light  line  is  for  no  equalization  while  the 
dark  line  incorporates  equalization.  The  dashed  line  indicates  r  —  2,  the  true  delay. 


w 


No 


exp(j27rJF(r  —  to))dF 


=  N0W 


-w  2 

sin(27rJF(r  —  to)) 


2t rW(r  -  t0) 


which  is  shown  in  Figure  19.4  as  the  dark  line.  As  before  it  has  been  normalized  to 
yield  one  at  its  peak.  Note  that  the  maximum  now  occurs  at  the  correct  location 
and  also  the  width  of  the  maximum  peak  is  narrower.  This  allows  a  better  location 
of  the  maximum  in  the  presence  of  noise. 

0 


19.7  Cross-correlation  Sequence  Estimation 

The  estimation  of  the  CCS  is  similar  to  that  for  the  ACS  (see  Section  17.7).  The 
main  difference  between  the  two  stems  from  the  fact  that  the  ACS  is  guaranteed  to 
have  a  maximum  at  k  —  0  while  the  maximum  for  the  CCS  can  be  located  anywhere. 
Furthermore,  two  samples  of  a  WSS  random  process  tend  to  become  less  correlated 
as  the  spacing  between  them  increases.  This  implies  that  it  is  only  necessary  to 
estimate  the  ACS  rx[k]  for  k  =  0, 1, . . . ,  M  if  we  assume  that  rx[k]  ~  0  for  k  >  M. 
For  the  CCS,  however,  we  must  estimate  rx,Y[k]  for  k  =  —Mi, . . . ,  0, . . . ,  M2  (recall 
that  rx,y[~k]  7^  Cv,y[&])  for  which  rxy[k]  ^  0  if  k  <  —M\  or  k  >  M2.  In 
practice,  it  is  not  clear  how  Mi  and  M2  should  be  chosen.  Frequently,  a  preliminary 
estimate  of  r*x,y[&]  is  made,  followed  by  a  search  for  the  maximum  location.  Then, 
the  data  records  used  to  estimate  the  CCS  are  shifted  relative  to  each  other  to 
place  the  maximum  at  k  =  0.  This  is  called  time  alignment  [Jenkins  and  Watts 


662 


CHAPTER  19.  MULTIPLE  WSS  RANDOM  PROCESSES 


1968].  We  assume  that  this  has  already  been  done.  Then,  we  estimate  the  CCS  for 
\k\  <  M  assuming  that  we  have  observed  the  realizations  for  X [n]  and  Y [n] ,  both 
for  n  =  0, 1, . . . ,  IV  —  1.  The  estimate  becomes 


rx,y[K 


iv3fc£n= o1  kx[n]y[n  +  k]  A;  =  0,1, . . . ,  Af 

jvq*[  Yln=\k\  x[n]y[n  +  A;]  k  =  —M, -(M  -  1), ... ,  -1. 


(19.46) 


Note  that  the  summation  limits  have  been  chosen  to  make  sure  that  all  the  available 
products  x[n]y[n  +  k]  are  used.  Similar  to  the  estimation  of  the  ACS,  there  will 
be  a  different  number  of  products  for  each  k.  For  example,  if  N  =  4  so  that 
{:r[0|,  x [  1] ,  x [2] ,  x [3] }  and  {y[0],  y[l],  y[ 2],  y[3] }  are  observed,  and  we  wish  to  compute 
the  CCS  estimate  for  \k\  <  M  =  2,  we  will  have 


rx,r[-2]  = 


rx,Y  [O' 


rx,y[  1]  = 


rx,y[  2'  = 


1 Y,  *[»M"  -  2]  =  ^ (x[2]y[0)  +  x[3]!,[1)) 

71= 2 

1  3  1 

3  X  xin}y[n  - !]  =  +  ^[2]y[i]  +  z[3]y[2]) 

n=l 

1  *  a  1 

-  X  x[n]y[n]  =  Tl^fOjytO]  +  x[l]y[l]  +  x[2]y[2]  +  a;[%[3]) 

n= 0 

1  2  i 

3  X  xin}y[n  +  !]  =  3  (a;[0]y[l]  +  x[l]y[2]  +  z[2]y[3]) 

71—0 

1  ^  ^  A  1 

2  X  x\n\y[n  + 2]  =  2 (x[°]y[2]  +  ^[ilyt3])- 


As  an  example,  consider  the  jointly  WSS  random  processes  described  in  Example 
19.2,  where  X[n ]  =  C/[n],  Y[n]  =  U[n]  +  2U[n  —  1]  and  U[n]  is  white  noise  with 
variance  o\j  —  1.  We  further  assume  that  U[n]  has  a  Gaussian  PDF  for  each  n 
for  the  purpose  of  computer  simulation  (although  we  could  use  any  PDF  or  PMF). 
Recall  that  the  theoretical  CCS  is  rx,y[k]  =  6[k]  +  2 S[k  —  1].  The  estimated  CCS 
using  N  =  1000  data  samples  is  shown  in  Figure  19.5.  The  MATLAB  code  used 
to  estimate  the  CCS  is  given  below. 


7.  assume  realizations  are  x[n],  y[n]  for  n=l,2,...,N 
for  k=0:M  7.  compute  zero  and  positive  lags,  see  (19.46) 

7®  compute  values  for  k=0,l,...,M 
rxypos (k+1 , 1 ) = ( 1/ (N-k) ) *sum(x ( 1 : N-k) . *y ( 1+k : N) ) ; 
end 

for  k=l:M  7*  compute  negative  lags,  see  (19.46) 

7o  compute  values  for  k=-M,-(M-l)  , . .  .  ,-T 
rxyneg (k+1 , 1)  =  ( 1/  (N-k) )  *sum(x (k+1 :  N)  .  *y  ( 1 :  N-k) )  ; 
end 

rxy=[flipud(rxyneg(2:M+l,l))  ;rxypos]  ;  7.  arrange  values  from  k=-M  to  k=M 
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Figure  19.5:  Estimated  CCS  using  realizations  for  X[n\  and  Y[n\  with  N  =  1000 
samples.  The  theoretical  CCS  is  rx,y[k]  =  6[k]  +  2 S[k  —  1]. 


Finally,  we  note  that  estimation  of  the  CPSD  is  more  difficult  and  so  we  refer  the 
interested  reader  to  [Jenkins  and  Watts  1968,  Kay  1988]. 

19.8  Real-World  Example  -  Brain  Physiology  Research 

Understanding  the  operation  of  the  human  brain  is  one  of  the  most  important  goals 
of  physiological  research.  Currently,  there  is  an  enormous  effort  to  decipher  its 
inner  workings.  At  a  very  fundamental  level  is  the  study  of  its  cells  or  neurons , 
which  when  working  in  unison  form  the  basis  for  our  behavior.  Their  electrical 
activity  and  the  transmission  of  that  activity  to  neighboring  neurons  yields  clues 
as  to  the  brain’s  operation.  When  an  individual  neuron  “fires”  it  produces  a  spike 
or  electrical  pulse  that  propagates  to  nearby  neurons.  The  connections  between 
the  neurons  that  allow  this  propagation  to  occur  are  called  synapses  and  it  is  this 
connectivity  that  is  the  focus  of  much  research.  A  typical  spike  train  that  might 
be  recorded  is  shown  in  Figure  19.6a  for  a  neuron  at  rest  and  in  Figure  19.6b  for  a 
neuron  that  has  been  excited  by  some  stimulus.  Clearly,  the  firing  rate  increases  in 
response  to  a  stimulus.  The  model  used  to  produce  this  figure  is  an  IID  Bernoulli 
random  process  with  p  =  pq  =  0.1  for  Figure  19.6a  and  p  =  ps  =  0.6  for  Figure 
19.6b.  The  subscripts  “q”  and  “s”  are  meant  to  indicate  the  state  of  the  neuron, 
either  quiescent  or  stimulated.  Now  consider  the  question  of  whether  two  neurons 
are  connected  via  a  synapse.  If  they  are,  and  a  stimulus  is  applied  to  the  first 
neuron,  then  the  electrical  pulse  will  propagate  to  the  second  neuron  and  appear 
some  time  later.  Then,  we  would  expect  the  second  neuron  electrical  activity  to 
change  from  that  in  Figure  19.6a  to  that  in  Figure  19.6b.  It  would  be  fairly  simple 
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(a)  Quiescent,  pq  =0.1  (b)  Stimulated,  ps  =  0.6 


Figure  19.6:  Typical  spike  trains  for  neurons. 

then  to  estimate  the  p  for  each  possible  connected  neuron  and  choose  the  neuron  or 
neurons  (there  may  be  multiple  connections  with  the  stimulated  neuron)  for  which 
p  is  large.  Unfortunately,  it  is  not  easy  to  stimulate  a  single  neuron  so  that  when 
a  stimulus  is  applied,  many  neurons  may  be  activated.  Thus,  we  need  a  method 
to  associate  one  stimulated  neuron  with  its  connected  ones.  Ideally,  if  we  record 
the  electrical  activity  at  two  neurons  under  consideration,  denoted  by  X\  [n\  and 
X2 [n],  then  for  connected  neurons  X2 [n]  =  X\ [n  —  no].  Since  we  have  assumed  that 
the  spike  train  for  the  first  neuron  X\  [n\  is  an  IID  random  process,  it  is  therefore 
WSS  and  we  know  from  Example  19.1,  the  two  random  processes  are  jointly  WSS. 
Therefore,  we  have  as  before 

rXi  ,x2  [k]  —  E[X  1  [n]X2  [n  +  k ]] 

-  E[X\[n]Xi[n  —  n0  +  k]] 

=  rx  1  [k  -  n0] 

and  therefore  the  CCS  will  exhibit  a  maximum  at  k  =  no.  Otherwise,  if  the  neu¬ 
rons  are  not  connected,  we  would  expect  a  much  smaller  value  of  the  maximum  or 
no  discernible  maximum  at  all.  For  example,  for  unconnected  but  simultaneously 
stimulated  neurons  it  is  reasonable  to  assume  that  X\  [n]  and  X2[n]  are  uncorrelated 
and  hence  rXi,x2[k\  =  E[X\ [n]]£7[X2[n  +  k]]  =  p2s ,  which  presumably  will  be  less 
than  rxi  [k  —  no]  at  its  peak.  Note  that  for  connected  neurons 

rxux2[k]  =  cov(X1[n],X2[n  +  k])  +  E[Xi[n]]E[X2[n  +  k]]  >  S[Xi[n]]£?[X2[n  +  k]] 

if  the  covariance  is  positive. 

Specifically,  we  assume  that  a  neuron  output  is  modeled  as  an  IID  Bernoulli 
random  process  that  takes  on  the  values  1  and  0  with  probabilities  ps  and  1  —  ps, 
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respectively.  For  two  neurons  that  are  connected  we  have  that  rXl  ,x2  M  =  fXi  [&  — 
no].  But 

rXl[k]  =  EiX^X^n  +  k]] 

r  E[Xl\n]\  k  =  0 

\  E[X i  [n]\E[X i  [n  +  k]]  k^  0 

_  (  ps  k  =  0 

~  l  P?  &  7^  0 

=  ps{l  - ps)6[k]  +  p2s. 

Hence,  for  two  connected  neurons  the  CCS  is 

rxux2[k]  =Ps{  1  -p4)5[fc-n0]  +Pj. 

For  two  neurons  that  are  not  connected,  so  that  their  outputs  are  uncorrelated  (even 
if  both  are  stimulated),  the  CCS  is 

rXuX2[k]  =  E[X\  [n]]X2[n  +  fc]] 

=  E{Xl[n]]E[X2[n  +  k}} 

C\ 

=  ps  for  all  k. 

As  a  result,  the  maximum  is  p2s  for  unconnected  neurons  but  ps(l  —  ps)  +p2  =  ps  >  p2s 
for  connected  neurons.  The  two  different  CCSs  are  shown  in  Figure  19.7  for  ps  =  0.6 
and  no  =  2.  As  an  example,  for  ps  =  0.6  we  show  realizations  of  three  neuron 


(a)  Unconnected 


(b)  Connected  with  no  =  2 


Figure  19.7:  CCS  for  unconnected  and  connected  stimulated  neurons  withps  =  0.6. 
outputs  in  Figure  19.8,  where  only  neuron  1  and  neuron  3  are  connected.  There  is 
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(a)  Neuron  1  (b)  Neuron  2  (c)  Neuron  3 

Figure  19.8:  Spike  trains  for  three  neurons  with  neuron  1  connected  to  neuron  3 
with  a  delay  of  two  samples.  The  spike  train  of  neuron  2  is  uncorrelated  with  those 
for  neurons  1  and  3. 


a  two  sample  delay  between  neurons  1  and  3.  Neuron  2  is  not  connected  to  either 
of  the  other  neurons  and  hence  its  spike  train  is  uncorrelated  with  the  others.  The 
theoretical  CCS  between  neurons  1  and  2  is  given  in  Figure  19.7a  while  that  between 
neurons  1  and  3  is  given  in  Figure  19.7b.  The  estimated  CCS  for  the  spike  trains 
shown  in  Figure  19.8  and  based  on  the  estimate  of  (19.46)  is  shown  in  Figure  19.9. 


k  k 

(a)  Unconnected  neurons  1  and  2  (b)  Connected  neurons  1  and  3  with  no  —  2 

Figure  19.9:  Estimated  CCS  for  unconnected  and  connected  stimulated  neurons 
with  ps  =  0.6. 

It  is  seen  that  as  expected  there  is  a  maximum  at  k  —  no  =  2.  The  interested  reader 
should  consult  [Univ.  Pennsylvannia  2005]  for  further  details. 
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Problems 

19.1  (o)  (w)  Two  discrete-time  random  processes  are  defined  as  X[n]  =  U[n]  and 
Y  [n]  =  ( —1 ) "  U  [n]  for  —  oo  <  n  <  oo,  where  U[n]  is  white  noise  with  variance 
afj.  Are  the  random  processes  X[n]  and  Y[n]  jointly  WSS? 

19.2  (w)  Two  discrete-time  random  processes  are  defined  as  X[n\  =  ciy U\ [n]  + 
a-) U2 [n]  and  Y[n]  =  b\ U\  [n]  +  62^2 [n]  for  —  00  <  n  <  00,  where  U\[n]  and  U-> [n] 
are  jointly  WSS  and  ai,  02,  i>i ,  62  are  constants.  Are  the  random  processes  X\n] 
and  Y[n]  jointly  WSS? 

19.3  (f)  If  the  CCS  is  given  as  rx,y[k]  =  (l/2)lfc-1l  for  —00  <  k  <  00,  plot  it  and 
describe  which  properties  are  the  same  or  different  from  an  ACS. 

19.4  (f)  If  Y[n ]  =  X [n]  +  W[n],  where  X[n]  and  W [n]  are  jointly  WSS,  find  r_\.Y [k] 

and  Px,yU)- 

19.5  (o)  (w)  A  discrete-time  random  process  is  defined  as  Y[n]  =  X[n]W[n], 
where  X[n]  is  WSS  and  W[n]  is  an  IID  Bernoulli  random  process  that  takes  on 
values  ±1  with  equal  probability.  The  random  processes  A[n]  and  W[n]  are 
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independent  of  each  other,  which  means  that  X[n\]  is  independent  of  W[ri2] 
for  all  n\  and  n 2.  Find  rx,y[fc]  and  explain  your  results. 

19.6  (^)  (w)  In  this  problem  we  show  that  for  the  AR  random  process  X[n\  = 
aX[n  —  1]  +  C7[n],  which  was  described  in  Example  17.5,  the  cross-correlation 
sequence  E[X[n]U[n  +  k]]  =  0  for  k  >  0.  Do  so  by  evaluating  E[X[n](X[n  + 
k]  —  aX[n  +  k  —  1])].  Determine  and  plot  the  CCS  rx,u[k\  for  —00  <  k  <  00  if 
a  —  0.5  and  <r^  =  1.  Hint:  Refer  back  to  Example  17.5  for  the  ACS  of  an  AR 
random  process. 

19.7  (f)  If  X[n\  and  Y[n\  are  jointly  WSS  with  ACSs 


rx[k]  =  4^-J 

rY[k\  =  3S[k\  +  2S[k  +  1]  +  2S[k  -  1] 
determine  the  maximum  possible  value  of  rx,y[&]. 

19.8  (t)  Derive  (19.14).  To  do  so  use  the  relationship  Ylm=-M  lHn=-M  #[m  ~  n]  — 

El=-2Mm  +  1  -  \k\)g[k\. 

19.9  (f)  For  the  two  sinusoidal  random  processes  X[n\  =  cos(27r/on  +  @1)  and 
Y[n ]  —  cos(27r/on  +  ©2),  where  ©1  =  @2  ~  ZY(0, 2n)  find  the  CPSD  and 
explain  your  results  versus  the  case  when  ©1  and  @2  are  independent  random 
variables. 

19.10  (o)(f,c)  If  ^x,y[k\  —  S[k]  +  2 S[k  —  1],  plot  the  magnitude  and  phase  of  the 
CPSD.  You  will  need  a  computer  to  do  this. 

19.11  (f)  For  the  random  processes  X[n\  =  U[n\  and  Y[n]  =  U[n]  —  bU[n  —  1], 
where  U[n]  is  discrete  white  noise  with  variance  o\j  —  1,  find  the  CPSD  and 
explain  what  happens  as  b  — >  0. 

19.12  (^)  (w)  If  a  random  process  is  defined  as  Z[n]  =  X[n\  —  Y[n],  where  X[n] 
and  Y[n\  are  jointly  WSS,  determine  the  ACS  and  PSD  of  Z[n]. 

19.13  (w)  For  the  random  processes  X[n\  and  Y[n\  defined  in  Problem  19.11  find 
the  coherence  function.  Explain  what  happens  as  b  — 0. 

19.14  (f)  Determine  the  CPSD  for  two  jointly  WSS  random  processes  if  rx,y[&]  = 
S[k]  —  S[k  —  1].  Also,  explain  why  the  coherence  function  at  /  =  0  is  zero. 
Hint:  The  random  processes  X[n]  and  Y[n]  are  those  given  in  Problem  19.11 
if  b  =  1. 

19.15  (o)  (f)  If  Y[n]  =  —  X[n\  for  —  00  <  n  <  00,  determine  the  coherence  func¬ 
tion  and  relate  it  to  the  predictability  of  Y[no]  based  on  observing  X[n\  for 
—00  <  n  <  00. 
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19.16  (t)  A  cross-spectral  matrix  is  defined  as 

Px(f)  Px,y(f) 

Py,x(f)  PyU)  .  * 

Prove  that  the  cross-spectral  matrix  is  positive  semidefinite  for  all  /.  Hint: 
Show  that  the  principal  minors  of  the  matrix  are  all  nonnegative  (see  Appendix 
C  for  the  definition  of  principal  minors).  To  do  so  use  the  properties  of  the 
coherence  function. 

19.17  (w)  The  random  processes  X[n\  and  Y[n\  are  zero  mean  jointly  WSS  and 
are  uncorrelated  with  each  other.  If  rx[k]  =  2 d[k\  and  ry[fc]  =  (1/2)  1^1  for 
— oo  <  k  <  oo,  find  the  PSD  of  X[n]  +  Y[n]. 

19.18  (^)  (t)  In  this  problem  we  derive  an  extension  of  the  Wiener  smoother  (see 
Section  18.5.1).  We  consider  the  problem  of  estimating  Y[tiq]  based  on  ob¬ 
serving  X[n\  for  —  oo  <  n  <  oo.  To  do  so  we  use  the  linear  estimator 

oo 

y[n0]  =  h[k]X[n0  -  k}. 

k=— oo 

To  find  the  optimal  impulse  response  we  employ  the  orthogonality  principle 
to  yield  the  infinite  set  of  simultaneous  linear  equations 

oo 

(y[n0]  -  ^2  hik]x[no  -  k])X[no  -  l ] 

k=— oo 

Assuming  that  X[n\  and  Y[n\  are  jointly  WSS  random  processes,  determine 
the  frequency  response  of  the  optimal  Wiener  estimator.  Then,  show  how  the 
Wiener  smoother,  where  Y[n]  represents  the  signal  S[n ]  and  X[n]  represents 
the  signal  S[n]  plus  noise  W[n\  (recall  that  S[n]  and  W[n\  are  zero  mean  and 
uncorrelated  random  processes),  arises  as  a  special  case  of  this  solution. 

19.19  (f)  For  the  random  processes  defined  in  Example  19.2  determine  the  CPSD. 
Next,  find  the  optimal  Wiener  smoother  for  Y[uq]  based  on  the  realization  of 
X[n\  for  — oo  <  n  <  oo. 

19.20  (t)  Prove  that  if  X[n\  is  a  WSS  random  process  that  is  input  to  an  LSI  system 
and  Y[n]  is  the  corresponding  random  process  output,  then  the  coherence 
function  between  the  input  and  output  has  a  magnitude  of  one. 

19.21  (t)  Consider  a  WSS  random  process  X[n\  that  is  input  to  an  LSI  system  with 
frequency  response  #(/),  where  H(f)  /  0  for  |/|  <  1/2,  and  let  Y[n]  be  the 
corresponding  random  process  output.  It  is  desired  to  predict  X[uq]  based  on 
observing  Y[n]  for  —  oo  <  n  <  oo.  Draw  a  linear  filtering  diagram  (similar  to 
that  shown  in  Figure  19.2)  to  explain  why  X[uq]  is  perfectly  predictable  by 
passing  Y[n]  through  a  filter  with  frequency  response  l/H(f). 


=  0  —  oo  <  l  <  oo. 
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19.22  (t)  In  this  problem  we  argue  that  a  Fourier  transform  is  actually  a  narrow- 
band  filtering  operation.  First  consider  the  Fourier  transform  at  /  =  /o  for 

A 

the  truncated  random  process  X[n],  n  =  —  M, . . . ,  0, . . . ,  M  which  is  X(fo)  = 
YkL-M  exp(— j27t/oA;).  Next  show  that  this  may  be  written  as 


oo 

X(fo)=  X[k]h[n  —  k] 

k=— oo 


72=0 


where 


exp(jf27r/oA;)  k  =  — M, . . . ,  0, . . . ,  M 
0  |fc|  >  M. 


Notice  that  this  is  a  convolution  sum  so  that  h[k]  can  be  considered  as  the 
impulse  response,  although  a  complex  one,  of  an  LSI  filter.  Finally,  find  and 
plot  the  frequency  response  of  this  filter.  Hint:  You  will  need 


M 


Y  expUke) 

k=—M 


sin((2M  +  1)0/2) 
sin(0/2) 


19.23  (^)  (w)  Consider  the  continuous-time  averager 

Y(t)  =  i  r  X(Od£ 

1  Jt-T 

where  the  random  process  X(t)  is  continuous-time  white  noise  with  PSD 
Px{F)  =  Nq/2  for  — oo  <  F  <  oo.  Determine  the  CCF  rx,y(r)  and  show 
that  it  is  zero  for  r  outside  the  interval  [0,T].  Explain  why  it  is  zero  outside 
this  interval. 

19.24  (f)  If  a  continuous-time  white  noise  process  X(t)  with  ACF  rx(r)  =  ( No/2)S(t ) 
is  input  to  an  LTI  system  with  impulse  response  h(r)  =  exp(— r)u(r),  deter¬ 
mine  rx,Y(r)' 

19.25  (t)  Can  the  CPSD  ever  have  the  same  properties  as  the  PSD  in  terms  of  being 
real  and  symmetric?  If  so,  give  an  example.  Hint:  Consider  the  relationship 
given  in  (19.43). 

19.26  (0)  (f,c)  Consider  the  random  processes  X[n\  =  U[n]  and  Y[n]  =  U[n]  — 
bU[n  —  1],  where  U[n]  is  white  Gaussian  noise  with  variance  <7^  =  1.  Find 

and  then  to  verify  your  results  perform  a  computer  simulation.  To  do 
so  first  generate  N  =  1000  samples  of  X[n ]  and  Y[n].  Then,  estimate  the  CCS 
for  b  =  —0.1  and  b  =  —  1.  Explain  your  results. 
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19.27  (f,c)  An  AR  random  process  is  given  by  X[n\  =  aX[n  —  1]  +  {7[n],  where  U[n\ 
is  white  Gaussian  noise  with  variance  Find  the  CCS  rx,u[k]  and  then  to 
verify  your  results  perform  a  computer  simulation  using  a  =  0.5  and  =  1. 
To  do  so  first  generate  N  =  1000  samples  of  U[n]  and  X[n\.  Then,  estimate  the 
CCS.  Hint:  Remember  to  set  the  initial  condition  X[—  1]  ~  A7(0,  cr^/(l  —  a2)). 

19.28  (w)  In  this  problem  we  explore  the  use  of  the  CCF  to  determine  the  direction 
of  arrival  of  a  sound  source.  Referring  to  Figure  19.10,  a  sound  source  emits  a 
pulse  that  propagates  to  a  set  of  two  receivers.  Because  the  distance  from  the 
source  to  the  receivers  is  large,  it  is  assumed  that  the  wavefronts  are  planar 
as  shown.  If  the  source  has  the  angle  6  with  respect  to  the  x  axis  as  shown, 
it  first  reaches  receiver  2  and  then  reaches  receiver  1  at  a  time  to  =  dcos(9)/c 
seconds  later,  where  d  is  the  distance  between  receivers  and  c  is  the  propagation 
speed.  Assume  that  the  received  signal  at  receiver  2  is  a  WSS  random  process 
X2(t)  =  U(t)  with  a  PSD 


|F|  <  W 
|F|  >W 


and  therefore  the  received  signal  at  receiver  1  is  X\ (t)  =  U(t  —  to)-  Determine 
the  CCF  rx i,x2(r)  and  describe  how  it  could  be  used  to  find  the  arrival  angle 


0. 


sound  source 


Figure  19.10:  Geometry  for  sound  source  arrival  angle  measurement  (figure  for 
Problem  19.28). 


Chapter  20 


Gaussian  Random  Processes 


20.1  Introduction 

There  are  several  types  of  random  processes  that  have  found  wide  application  be¬ 
cause  of  their  realistic  physical  modeling  yet  relative  mathematical  simplicity.  In 
this  and  the  next  two  chapters  we  describe  these  important  random  processes.  They 
are  the  Gaussian  random  process,  the  subject  of  this  chapter;  the  Poisson  random 
process,  described  in  Chapter  21;  and  the  Markov  chain,  described  in  Chapter  22. 
Concentrating  now  on  the  Gaussian  random  process,  we  will  see  that  it  has  many 
important  properties.  These  properties  have  been  inherited  from  those  of  the  N- 
dimensional  Gaussian  PDF,  which  was  discussed  in  Section  14.3.  Specifically,  the 
important  characteristics  of  a  Gaussian  random  process  are: 

1.  It  is  physically  motivated  by  the  central  limit  theorem  (see  Chapter  15). 

2.  It  is  a  mathematically  tractable  model. 

3.  The  joint  PDF  of  any  set  of  samples  is  a  multivariate  Gaussian  PDF,  which 

enjoys  many  useful  properties  (see  Chapter  14). 

4.  Only  the  first  two  moments,  the  mean  sequence  and  the  covariance  sequence,  are 

required  to  completely  describe  it.  As  a  result, 

a.  In  practice  the  joint  PDF  can  be  estimated  by  estimating  only  the  first  two 

moments. 

b.  If  the  Gaussian  random  process  is  wide  sense  stationary,  then  it  is  also 

stationary. 

5.  The  processing  of  a  Gaussian  random  process  by  a  linear  filter  does  not  alter 

its  Gaussian  nature,  but  only  modifies  the  first  two  moments.  The  modified 
moments  are  easily  found. 
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In  effect,  the  Gaussian  random  process  has  so  many  useful  properties  that  it  is  always 
the  first  model  to  be  proposed  in  the  solution  of  a  problem.  It  finds  application 
as  a  model  for  electronic  noise  [Bell  Labs  1970],  ambient  ocean  noise  [Urick  1975], 
scattering  phenomena  such  as  reverberation  of  sound  in  the  ocean  or  electromagnetic 
clutter  in  the  atmosphere  [Van  Trees  1971],  and  financial  time  series  [Taylor  1986], 
just  to  name  a  few.  Any  time  a  random  process  can  be  modeled  as  due  to  the  sum  of 
a  large  number  of  independent  and  similar  type  effects,  a  Gaussian  random  process 
results  due  to  the  central  limit  theorem.  One  example  that  we  will  explore  in  detail 
is  the  use  of  the  scattering  of  a  sound  pulse  from  a  school  of  fish  to  determine  their 
numbers  (see  Section  20.9).  In  this  case,  the  received  waveform  is  the  sum  of  a  large 
number  of  scattered  pulses  that  have  been  added  together.  The  addition  occurs 
because  the  leading  edge  of  a  pulse  that  is  reflected  from  a  fish  farther  away  will 
coincide  in  time  with  the  trailing  edge  of  the  pulse  that  is  reflected  from  a  fish  that 
is  nearer  (see  Figure  20.14).  If  the  fish  are  about  the  same  size  and  type,  then  the 
average  intensity  of  the  returned  echos  will  be  relatively  constant.  However,  the 
echo  amplitudes  will  be  different  due  to  the  different  reflection  characteristics  of 
each  fish,  i.e.,  its  exact  position,  orientation,  and  motion  will  all  determine  how  the 
incoming  pulse  is  scattered.  These  characteristics  cannot  be  predicted  in  advance 
and  so  the  amplitudes  are  modeled  as  random  variables.  When  overlapped  in  time, 
these  random  echos  are  well  modeled  by  a  Gaussian  random  process.  As  an  example, 
consider  a  transmitted  pulse  s(t)  =  cos(27tFo£),  where  Fo  =  10  Hz,  over  the  time 
interval  0  <  t  <  1  second  as  shown  in  Figure  20.1.  Assuming  a  single  reflection 


(a)  Transmit  pulse 


(b)  Transmit  pulse  shown  in  receive  wave¬ 
form  observation  window 


Figure  20.1:  Transmitted  sinusoidal  pulse. 

for  every  0.1  second  interval  with  the  starting  time  being  a  uniformly  distributed 
random  variable  within  the  interval  and  an  amplitude  A  that  is  a  random  variable 
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with  A  ~  W(0, 1)  to  account  for  the  unknown  reflection  coefficient  of  each  fish,  a 
typical  received  waveform  is  shown  in  Figure  20.2.  If  we  now  estimate  the  marginal 


t  (sec) 

Figure  20.2:  Received  waveform  consisting  of  many  randomly  overlapped  and  ran¬ 
dom  amplitude  echos. 

PDF  for  x(t)  as  shown  in  Figure  20.2  by  assuming  that  each  sample  has  the  same 
marginal  PDF,  we  have  the  estimated  PDF  shown  in  Figure  20.3  (see  Section  10.9 
on  how  to  estimate  the  PDF).  Also  shown  is  the  Gaussian  PDF  with  its  mean 
and  variance  estimated  from  uniformly  spaced  samples  of  x(t).  It  is  seen  that  the 
Gaussian  PDF  is  very  accurate  as  we  would  expect  from  the  central  limit  theorem. 
The  MATLAB  code  used  to  generate  Figure  20.2  is  given  in  Appendix  20A.  In 
Section  20.3  we  formally  define  the  Gaussian  random  process. 

20.2  Summary 

Section  20.1  gives  an  example  of  why  the  Gaussian  random  process  arises  quite 
frequently  in  practice.  The  discrete-time  Gaussian  random  process  is  defined  in 
Section  20.3  as  one  whose  samples  comprise  a  Gaussian  random  vector  as  charac¬ 
terized  by  the  PDF  of  (20.1).  Also,  some  examples  are  given  and  are  shown  to 
exhibit  two  important  properties,  which  are  summarized  in  that  section.  Any  linear 
transformation  of  a  Gaussian  random  process  produces  another  Gaussian  random 
process.  In  particular  for  a  discrete-time  WSS  Gaussian  random  process  that  is 
filtered  by  an  LSI  filter,  the  output  random  process  is  Gaussian  with  PDF  given 
in  Theorem  20.4.1.  A  nonlinear  transformation  does  not  maintain  the  Gaussian 
random  process  but  its  effect  can  be  found  in  terms  of  the  output  moments  using 
(20.12).  An  example  of  a  squaring  operation  on  a  discrete-time  WSS  Gaussian  ran¬ 
dom  process  produces  an  output  random  process  that  is  still  WSS  with  moments 
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Figure  20.3:  Marginal  PDF  of  samples  of  received  waveform  shown  in  Figure  20.2 
and  Gaussian  PDF  fit. 


given  by  (20.14).  A  continuous-time  Gaussian  random  process  is  defined  in  Sec¬ 
tion  20.6  and  examples  are  given.  An  important  one  is  the  Wiener  random  process 
examined  in  Example  20.7.  Its  covariance  matrix  is  found  using  (20.16).  Some 
special  continuous-time  Gaussian  random  processes  are  described  in  Section  20.7. 
The  Rayleigh  fading  sinusoid  is  described  in  Section  20.7.1.  It  has  the  ACF  given 
by  (20.17)  and  corresponding  PSD  given  by  (20.18).  A  continuous-time  bandpass 
Gaussian  random  process  is  described  in  Section  20.7.2.  It  has  an  ACF  given  by 
(20.21)  and  a  corresponding  PSD  given  by  (20.22).  The  important  example  of  band¬ 
pass  “white”  Gaussian  noise  is  discussed  in  Example  20.8.  The  computer  generation 
of  a  discrete-time  WSS  Gaussian  random  process  realization  is  described  in  Section 
20.8.  Finally,  an  application  of  the  theory  to  estimating  fish  populations  using  a 
sonar  is  the  subject  of  Section  20.9. 


20.3  Definition  of  the  Gaussian  Random  Process 


We  will  consider  here  the  discrete-time  Gaussian  random  process,  an  example  of 
which  was  given  in  Figure  16.5b  as  the  discrete-time/continuous- valued  (DTCV) 
random  process.  The  continuous-time/continuous- valued  (CTCV)  Gaussian  ran¬ 
dom  process,  an  example  of  which  was  given  in  Figure  16. 5d,  will  be  discussed  in 
Section  20.6.  Before  defining  the  Gaussian  random  process  we  briefly  review  the 
AT-dimensional  multivariate  Gaussian  PDF  as  described  in  Section  14.3.  An  N  x  1 
random  vector  X  =  [X\  X<i . . .  Xn]t  is  defined  to  be  a  Gaussian  random  vector  if 
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its  joint  PDF  is  given  by  the  multivariate  Gaussian  PDF 

1  i 

px(x)  = - - - r-7T - exp  —  -(x  —  //)rC-1(x  —  u) 

(27r)Ar/2det1/2(C)  L  2  J 

where  fi  =  [p,\  n2  ■  ■  ■  Hn]T  is  the  mean  vector  defined  as 

EXl  [Xt] 

EXi[X2} 

EXn  [-X’jv] 

and  C  is  the  N  x  N  covariance  matrix  defined  as 


A*  =  £x[X]  = 


(20.1) 


(20.2) 


var(Xi)  cov(Xi .  X2)  ...  cov(Xi,Xx) 

cov(X2,Xi)  var(X2)  ...  co v(X2,Xx) 

•  •  •  • 

•  •  •  • 

cov(Xn,Xi)  cov(Xn,X2)  ...  var(X^) 


(20.3) 


In  shorthand  notation  X  ~  J\f(pL,C).  The  important  properties  of  a  Gaussian 
random  vector  are: 


1.  Only  the  first  two  moments  fi  and  C  are  required  to  specify  the  entire  PDF. 

2.  If  all  the  random  variables  are  uncorrelated  so  that  [C ]ij  =  0  for  i  ^  j,  then  they 

are  also  independent. 

3.  A  linear  transformation  of  X  produces  another  Gaussian  random  vector.  Specif¬ 

ically,  if  Y  =  GX,  where  G  is  an  M  x  N  matrix  with  M  <  AT,  then 
Y-A/'(G/x,GCGt). 

Now  we  consider  a  discrete-time  random  process  X[n],  where  n  >  0  for  a  semi¬ 
infinite  random  process  and  — oo  <  n  <  oo  for  an  infinite  random  process.  The 
random  process  is  defined  to  be  a  Gaussian  random  process  if  all  finite  sets  of  sam¬ 
ples  have  a  multivariate  Gaussian  PDF  as  per  (20.1).  Mathematically,  if  X  = 
[X[ni]  X[n 2] . . .  X[tik]]t  has  a  multivariate  Gaussian  PDF  (given  in  (20.1)  with  N 
replaced  by  K)  for  all  {ni,  . . . ,  uk}  and  all  K ,  then  X[n\  is  said  to  be  a  Gaussian 
random  process.  Some  examples  follow. 

Example  20.1  -  White  Gaussian  noise 

White  Gaussian  noise  was  first  introduced  in  Example  16.6.  We  revisit  that  exam¬ 
ple  in  light  of  our  formal  definition  of  a  Gaussian  random  process.  First  recall  that 
discrete-time  white  noise  is  a  WSS  random  process  X[n]  for  which  22[X[n]]  =  /i  =  0 
for  —00  <  n  <  00  and  rx[k\  =  a25[h\.  This  says  that  all  the  samples  are  zero  mean, 
uncorrelated  with  each  other,  and  have  the  same  variance  a2.  If  we  now  further¬ 
more  assume  that  the  samples  are  also  independent  and  each  sample  has  a  Gaussian 


678 


CHAPTER  20.  GAUSSIAN  RANDOM  PROCESSES 


PDF ,  then  X[n]  is  a  Gaussian  random  process.  It  is  referred  to  as  white  Gaussian 
noise  (WGN).  To  verify  this  we  need  to  show  that  any  set  of  samples  has  a  mul¬ 
tivariate  Gaussian  PDF.  Let  X  =  [X[ni]  X[n2] . . .  X[tik]]t  and  note  that  the  joint 
if-dimensional  PDF  is  the  product  of  the  marginal  PDFs  due  to  the  independence 
assumption.  Also,  each  marginal  PDF  is  X[rij\  ~  A7(0,cr2)  by  assumption.  This 
produces  the  joint  PDF 

K 

Px(x)  =  n  Px[m](x[ni\) 

2—1 

=  i^Pexp(“^l2[n-1) 

=  - — —  1/9 - exp  (— ixT(cr2I)_1x) 

(2n)K/2  det1/2(a2I)  V  2  ’ 

or  X  ~  Ar(0.  cr2I),  where  I  is  the  K  x  K  identity  matrix.  Note  also  that  since  WGN 
is  an  IID  random  process,  it  is  also  stationary  (see  Example  16.3). 

0 


Example  20.2  —  Moving  average  random  process 

Consider  the  MA  random  process  X[n ]  =  (U[n]  +  U[n  —  l])/2,  where  U[n]  is  WGN 
with  variance  erf- .  Then,  X[n]  is  a  Gaussian  random  process.  This  is  because  U[n] 
is  a  Gaussian  random  process  (from  previous  example)  and  X[n]  is  just  a  linear 
transformation  of  U[n ].  For  instance,  if  K  =  2,  and  n\  =  0.  ri2  ~  1,  then 


and  thus  X  ~  A/"(0,  GC[/GT)  =  Af(0,  cr^G GT  ).  The  same  argument  applies  to  any 
number  of  samples  K  and  any  samples  times  «i,  ri2, . . . ,  hr-  Note  here  that  the  MA 
random  process  is  also  stationary.  If  we  were  to  change  the  two  samples  to  n i  =  no 
and  n2  =  no  +  1,  then 


A  [n0] 

A  [n0  +  1] 


'  U[n0  -  1]  ‘ 
U[n  0] 

.  Cf  [no  +  1]  . 
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and  the  joint  PDF  will  be  the  same  since  the  U  vector  has  the  same  PDF.  Again 
this  result  remains  the  same  for  any  number  of  samples  and  sample  times.  We  will 
see  shortly  that  a  Gaussian  random  process  that  is  WSS  is  also  stationary.  Here, 
the  U[n\  random  process  is  WSS  and  hence  X[n\  is  WSS,  being  the  output  of  an 
LSI  filter  (see  Theorem  18.3.1). 

As  a  typical  probability  calculation  let  Gjj  —  1  and  determine  P[X[1]  —  X[0]  > 
1].  We  would  expect  this  to  be  less  than  P[U[  1]  —  U[0]  >  1]  =  Q(l/y/2)  (since 
U[  1]  —  U[ 0]  ~  A7(0, 2))  due  to  the  smoothing  effect  of  the  filter  ( T-L{z )  =  \  + 

Thus,  let  Y  =  X[l]  -  X[0]  or 


x 


Then,  since  Y  is  a  linear  transformation  of  X,  we  have  Y  ~  A7(0,  var(T)),  where 
var(y)  =  ACAt.  Thus, 


vari 


t  -1  1  ]( 

-1 

[  -1  1  ]  GG 

'  1 

[-ii] 

2 

0 

-1 
1 

3  0 

1  l 

2  2 


(C  = 


b  0 


0  i 


1 

2 
1 

2  J 


GGT  =  GGT) 


-1 

1 


1 

2 


so  that  Y  ~  Af( 0, 1/2).  Therefore, 


P[X[  1]  - *[0]  >  1  ]  =  Q  =  °-0786  <  Q  (^) 


-  0.2398 


and  is  consistent  with  our  notion  of  smoothing. 


0 


Example  20.3  —  Discrete-time  Wiener  random  process  or  Brownian  mo¬ 
tion 

This  random  process  is  basically  a  random  walk  with  Gaussian  “steps”  or  more 
specifically  the  sum  process  (see  also  Example  16.4) 

X[n]  =  U\i] 

i= 0 


n  >  0 
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where  U[n]  is  WGN  with  variance  afj.  Note  that  the  increments  X[n2]  —  Xfni]  are 
independent  and  stationary  (why?).  As  in  the  previous  example,  any  set  of  samples 
of  X[n\  is  a  linear  transformation  of  the  f7[i]’s  and  hence  has  a  multivariate  Gaussian 
PDF.  For  example, 


•  X[0]  ■ 

'10' 

■  U{ 0]  ■ 

.  *[1] . 

s 

1  1 

✓ 

.  m . 

G 


and  therefore  the  Wiener  random  process  is  a  Gaussian  random  process.  It  is  clearly 
nonstationary,  since,  for  example,  the  variance  increases  with  n  (recall  from  Example 
16.4  that  var(X[n])  =  (n  +  l)cr^). 

0 

In  Example  20.1  we  saw  that  if  the  samples  are  uncorrelated,  and  the  random 
process  is  Gaussian  and  hence  the  multivariate  Gaussian  PDF  applies,  then  the 
samples  are  also  independent  In  Examples  20.1  and  20.2,  the  random  processes 
were  WSS  but  due  to  the  fact  that  they  are  also  Gaussian  random  processes,  they 
are  also  stationary.  We  summarize  and  then  prove  these  two  properties  next. 

Property  20.1  —  A  Gaussian  random  process  with  uncorrelated  samples 
has  independent  samples. 

Proof: 

Since  the  random  process  is  Gaussian,  the  PDF  of  (20.1)  applies  for  any  set  of 
samples.  But  for  uncorrelated  samples,  the  covariance  matrix  is  diagonal  and  hence 
the  joint  PDF  factors  into  the  product  of  its  marginal  PDFs.  Hence,  the  samples 
are  independent. 

□ 


Property  20.2  -  A  WSS  Gaussian  random  process  is  also  stationary. 

Proof: 

Since  the  random  process  is  Gaussian,  the  PDF  of  (20.1)  applies  for  any  set  of 
samples.  But  if  X[n\  is  also  WSS,  then  for  any  no 

E[X[ni  +  n0]]  =  p>x[ni  +  n0]  =  p  i  =  1, 2, . . . ,  K 


and 


[C]y  =  CO v(X[ni+n0],X[nj +n0]) 

—  E[X[rii  +  n0\X[rij  +  no]]  -  E[X[m  +  no]]E[X[rij  +  no]] 

=  rx [nj  —  rii]  —  fj,2  (due  to  WSS) 

for  i  =  1,2 ,K  and  j  =  1.2.....  K.  Since  the  mean  vector  and  the  covariance 
matrix  do  not  depend  on  no,  the  joint  PDF  also  does  not  depend  on  no-  Hence,  the 
WSS  Gaussian  random  process  is  also  stationary. 

□ 
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20.4  Linear  Transformations 


Any  linear  transformation  of  a  Gaussian  random  process  produces  another  Gaus¬ 
sian  random  process.  In  Example  20.2  the  white  noise  random  process  U[n\  was 
Gaussian,  and  the  MA  random  process  X[n],  which  was  the  result  of  a  linear 
transformation,  is  another  Gaussian  random  process.  The  MA  random  process 
in  that  example  can  be  viewed  as  the  output  of  the  LSI  filter  with  system  function 
'H(z)  =  1/2  +  (l/2)z~~l  whose  input  is  U[n].  This  result,  that  if  the  input  to  an 
LSI  filter  is  a  Gaussian  random  process,  then  the  output  is  also  a  Gaussian  random 
process,  is  true  in  general.  The  random  processes  described  by  the  linear  difference 
equations 

X[n\  =  aX[n  —  1]  +  U[n\  AR  random  process  (see  Example  17.5) 

X[n]  —  U[n\  —  bU[n  —  1]  MA  random  process  (see  Example  18.6) 

X[n]  =  aX[n  —  1]  +  U[n]  —  bU[n  —  1]  ARMA  random  process 

(This  is  the  definition.) 


can  also  be  viewed  as  the  outputs  of  LSI  filters  with  respective  system  functions 


U(z)  = 
U{z)  = 
%{z)  = 


1 

1  —  az~l 
1  -  bz~x 
1  -  bz~l 
1  —  az~l ’ 


As  a  result,  since  the  input  U[n\  is  a  Gaussian  random  process,  they  are  all  Gaussian 
random  processes.  Furthermore,  since  it  is  only  necessary  to  know  the  first  two 
moments  to  specify  the  joint  PDF  of  a  set  of  samples  of  a  Gaussian  random  process, 
the  PDF  for  the  output  random  process  of  an  LSI  filter  is  easily  found.  In  particular, 
assume  we  are  interested  in  the  filtering  of  a  WSS  Gaussian  random  process  by  an 
LSI  filter  with  frequency  response  H{f).  Then ,  if  the  input  to  the  filter  is  the  WSS 
Gaussian  random  process  X[n],  which  has  a  mean  of  px  and  an  ACS  of  rx[k],  then 
we  know  from  Theorem  18.3.1  that  the  output  random  process  Y[n]  is  also  WSS  and 
its  mean  and  ACS  are 


Hy  —  pxH(0)  (20.4) 

Py(f)  =  \H(f)\2Px(f)  (20.5) 

and  furthermore  Y[n]  is  a  Gaussian  random  process  (and  is  stationary  according  to 
Property  20.2).  (See  also  Problem  20.7.)  The  joint  PDF  for  any  set  of  samples  of 
Y[n]  is  found  from  (20.1)  by  using  (20.4)  and  (20.5).  An  example  follows. 

Example  20.4  —  A  differencer 

A  WSS  Gaussian  random  process  X[n ]  with  mean  px  and  ACS  rx  [fc]  is  input  to 
a  differencer.  The  output  random  process  is  defined  to  be  Y[n ]  =  A[n]  —  X[n  —  1], 
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What  is  the  PDF  of  two  successive  output  samples?  To  solve  this  we  first  note  that 
the  output  random  process  is  Gaussian  and  also  WSS  since  a  differencer  is  just  an 
LSI  filter  whose  system  function  is  TL(z)  =  1  —  z~l.  We  need  only  find  the  first  two 
moments  of  Y[n\.  The  mean  is 

E[Y[n]}  =  E[X[n ]]  -  E[X[n  -  1]]  =  px  -  t*x  =  0 

and  the  ACS  can  be  found  as  the  inverse  Fourier  transform  of  Py(/).  But  from 
(20.5)  with  H(f )  =  /H(exp(j2Tcf))  =  1  —  exp(—  j2nf)^  we  have 

Py(f)  =  H{f)H*(f)PX(f) 

=  [1  -  exp(— j27r/)]  [1  -  exp(j2nf)]Px  (/) 

=  2 Px(f)  -  exp(j2nf)Px(f)  -  exp(-j2-7rf)Px(f)- 

Taking  the  inverse  Fourier  transform  produces 

ry[k ]  =  2  rx[k]  —  rx[k  +  1]  —  rx[k  —  1].  (20.6) 

For  two  successive  samples,  say  Y[0]  and  Y[l],  we  require  the  covariance  matrix  of 
Y  =  [Y[0]  Y[1]]t.  Since  Y[n\  has  a  zero  mean,  this  is  just 

cv = r ry  w ' 

[  ry[lj  ry[°]  . 

and  thus  using  (20.6),  it  becomes 

c  _[  2(rx[0]  -rx[l])  2rx [1]  -  rx [2]  -  rx [0]  ' 

y  [  2rx[l]  -  rx[ 2]  -  rx[0]  2(rx[0]  -  rx[l]) 

The  joint  PDF  is  then 

J»y[o],y[i](l/[0],y[l])  =  —  ^(-^GpV) 

where  y  =  [y[0]  y[l]]T.  See  also  Problem  20.5. 

❖ 

We  now  summarize  the  foregoing  results  in  a  theorem. 

Theorem  20.4.1  (Linear  filtering  of  a  WSS  Gaussian  random  process) 

Suppose  that  X [?i]  is  a  WSS  Gaussian  random  process  with  mean  fix  and  ACS  rx[k] 
that  is  input  to  an  LSI  filter  with  frequency  response  H(f).  Then ,  the  PDF  of  N 
successive  output  samples  Y  =  [Y[0]  Y[l] . . .  Y[N  —  l]]T  is  given  by 

PY<y)  =  p^detV^Cr)  eXP  [_  J(y  “  "'')TCf1(y  -  ">■>]  <20'7) 
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where 


r  »xH{  o)  i 


VxH{  0) 


(20.8) 


[Cr]mn  =  ryim  —  n]  -  0))2  (20.9) 

r\ 

=  /  t  |-H'(/)|2JPsr(/)exp(j27r/(m  -  n))d/  -  (nXH( 0)) 

(20.10) 

/or  m  =  1, 2, . . . ,  N;  n  =  1,2,...,  iV.  The  same  PDF  is  obtained  for  any  shifted  set 
of  successive  samples  since  Y[n ]  is  stationary. 

Note  that  in  the  preceding  theorem  the  covariance  matrix  is  a  symmetric  Toeplitz 
matrix  (all  elements  along  each  northwest-southeast  diagonal  are  the  same)  due  to 
the  assumption  of  successive  samples  (see  also  Section  17.4). 

Another  transformation  that  occurs  quite  frequently  is  the  sum  of  two  inde¬ 
pendent  Gaussian  random  processes.  If  X[n\  is  a  Gaussian  random  process  and 
Y[n\  is  another  Gaussian  random  process,  and  X[n]  and  Y[n]  are  independent,  then 
Z[n\  =  X[n\  +  Y[n]  is  a  Gaussian  random  process  (see  Problem  20.9).  By  inde¬ 
pendence  of  two  random  processes  we  mean  that  all  sets  of  samples  of  X[n]  or 
{X[ni],  X[ri2], . . . ,  X[tik]}  and  of  Y[n\  or  {Y [mi],  Y [m2], . . . ,  Y[mi]}  are  indepen¬ 
dent  of  each  other.  This  must  hold  for  all  ni, . . . ,  n#,  mi, . . . ,  mi  and  for  all  K  and 
L.  If  this  is  the  case  then  the  PDF  of  the  entire  set  of  samples  can  be  written  as 
the  product  of  the  PDFs  of  each  set  of  samples. 


20.5  Nonlinear  Transformations 

The  Gaussian  random  process  is  one  of  the  few  random  processes  for  which  the 
moments  at  the  output  of  a  nonlinear  transformation  can  easily  be  found.  In  par¬ 
ticular,  a  polynomial  transformation  lends  itself  to  output  moment  evaluation.  This 
is  because  the  higher-order  joint  moments  of  a  multivariate  Gaussian  PDF  can  be 
expressed  in  terms  of  first -  and  second- order  moments.  In  fact,  this  is  not  sur¬ 
prising  in  that  the  multivariate  Gaussian  PDF  is  characterized  by  its  first-  and 
second-order  moments.  As  a  result,  in  computing  the  joint  moments,  any  integral  of 
the  form  ^  xlf  . . .  x^pxx^.^Xn  (#15  •  •  •  ,XN)dx  1 . . .  dxjsf  must  be  a  function 

of  the  mean  vector  and  covariance  matrix.  Hence,  the  joint  moments  must  be  a 
function  of  the  first-  and  second-order  moments.  As  a  particular  case  of  interest, 
consider  the  fourth-order  moment  E[X  1 X2X3X4]  for  X  =  [X\  X2  X%  Xf]T  a  zero 
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mean  Gaussian  random  vector.  Then,  it  can  be  shown  that  (see  Problem  20.12) 

E[XxX2X^]  =  £[XiX2]£[X3X4]  +  £[XiX3]£[X2X4]  +  EiXxX^ElXiXs] 

(20.11) 

and  this  holds  even  if  some  of  the  random  variables  are  the  same  (try  X\  =  X2  = 
X3  =  X4  and  compare  it  to  E[XA]  for  X  ~  A/"(0, 1)).  It  is  seen  that  the  fourth-order 
moment  is  expressible  as  the  sum  of  products  of  the  second-order  moments,  which 
are  found  from  the  covariance  matrix.  Now  if  X[n\  is  a  Gaussian  random  process 
with  zero  mean,  then  we  have  for  any  four  samples  (which  by  the  definition  of  a 
Gaussian  random  process  has  a  fourth-order  Gaussian  PDF) 

E[X[m]X[n2]X[n3]X[n4]]  =  £[X[m]X[n2]]JS;[X[n3]X[n4]] 

+E[X[n1}X[n3]}E[X[n2}X[n4]} 
+E[X[n1]X[n4]]E[X[n2]X[n3]]  (20.12) 

and  if  furthermore,  X[n\  is  WSS,  then  this  reduces  to 

E[X[ni]X[n2]X[ns\X[n4]\  =  rx[ri2  -  ni]rx[n4  -  n3]  +  rx[n3  -  ni]rx[n4  -  n2] 

+rX[n4  -  ni]rx[n3  -  n2].  (20.13) 

This  formula  allows  us  to  easily  calculate  the  effect  of  a  polynomial  transformation 
on  the  moments  of  a  WSS  Gaussian  random  process.  An  example  follows. 

Example  20.5  -  Effect  of  squaring  WSS  Gaussian  random  process 

Assuming  that  X[n\  is  a  zero  mean  WSS  Gaussian  random  process,  we  wish  to 
determine  the  effect  of  squaring  it  to  form  Y[n]  =  X2[n].  Clearly,  Y[n]  will  no 
longer  be  a  Gaussian  random  process  since  it  can  only  take  on  nonnegative  values 
(see  also  Example  10.8).  We  can,  however,  show  that  Y[n\  is  still  WSS.  To  do  so 
we  calculate  the  mean  as 


E[Y[n]\  =  E[X2[n}}  =  rx[ 0] 

which  does  not  depend  on  n,  and  the  covariance  sequence  as 

E[Y[n]Y[n  +  k]}  =  E[X2[n]X2[n  +  k]] 

=  rx  [0]  +  2rx  M  (using  «i  =  =  n 

and  n3  =  n4  =  n  +  k  in  (20.13)) 

which  also  does  not  depend  on  n.  Thus,  at  the  output  of  the  squarer  the  random 
process  is  WSS  with 

HY  --  rx  [0] 

rY[k]  =  r2x[0]  +  2r2x[k].  (20.14) 
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Note  that  if  the  PSD  at  the  input  to  the  squarer  is  Px(f),  then  the  output  PSD  is 
obtained  by  taking  the  Fourier  transform  of  (20.14)  to  yield 

Pr(f)  =  r2x[0]5(f)+2Px(f)*Px(f)  (20.15) 

where 

Px(f)*Px(f)=  f\Px{v)PxU-v)dv 

is  a  convolution  integral.  As  a  specific  example,  consider  the  MA  random  process 
X[n\  =  (U[ri\  +  U[n  — 1])/2,  where  U[n\  is  WGN  with  variance  =  1.  Then,  typical 
realizations  of  X[n\  and  Y[n]  are  shown  in  Figure  20.4.  The  MA  random  process 


(a)  MA  random  process  (b)  Squared  MA  random  process 


Figure  20.4:  Typical  realization  of  a  Gaussian  MA  random  process  and  its  squared 
realization. 

has  a  zero  mean  and  ACS  rx[k)  =  (1/2 )8[k]  +  (1/4 )S[k  +  1]  +  (1/4 )S[k  —  1]  (see 
Example  17.3).  Because  of  the  squaring,  the  output  mean  is  E[Y [n]]  =  rx[ 0]  =  1/2. 
The  PSD  of  X[n]  can  easily  be  shown  to  be  Px{f)  =  (1  +  cos(27r/))/2  and  the  PSD 
of  Y[n\  follows  most  easily  by  taking  the  Fourier  transform  of  ry[fc].  From  (20.14) 
we  have 

rY[k]  =  r\ [0]  +  2r\  [k] 

=  \  +  2  (1#]  +  J<[*  +  1]  +  !<[*- 1]) 

=  i  + 2  +  T6Slk  + 11  +  - 1]) 

since  all  the  cross-terms  must  be  zero  and  S2[k  —  fco]  =  —  ko].  Thus,  we  have 

ry  W  =  J  +  +  1]  +  I«[fc  -  1] 
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and  taking  the  Fourier  transform  produces  the  PSD  as 

Py(f)  =  |<5(/)  +  \  +  \  cos(2t r/). 

The  PSDs  are  shown  in  Figure  20.5.  Note  that  the  squaring  has  produced  an  impulse 


/ 


/ 


(a)  MA  random  process 


(b)  Squared  MA  random  process 


Figure  20.5:  PSDs  of  Gaussian  MA  random  process  and  the  squared  random  process. 

at  /  =  0  of  strength  1/4  that  is  due  to  the  nonzero  mean  of  the  Y[n\  random  process. 
Also,  the  squaring  has  “widened”  the  PSD,  the  usual  consequence  of  a  convolution 
in  frequency. 

0 


20.6  Continuous-Time  Definitions  and  Formulas 

A  continuous-time  random  process  is  defined  to  be  a  Gaussian  random  process  if  the 
random  vector  X  =  [X(ti)  Xfo) . . .  X{tx)Y  has  a  multivariate  Gaussian  PDF  for 
all  {ti ,  £2, . . . ,  tx}  and  all  K.  The  properties  of  a  continuous-time  Gaussian  random 
process  are  identical  to  those  for  the  discrete-time  random  process  as  summarized 
in  Properties  20.1  and  20.2.  Therefore,  we  will  proceed  directly  to  some  examples 
of  interest. 

Example  20.6  -  Continuous-time  WGN 

The  continuous-time  version  of  discrete-time  WGN  as  defined  in  Example  20.1 
is  a  continuous-time  Gaussian  random  process  X(t)  that  has  a  zero  mean  and  an 
ACF  vx{t)  =  (Nq/2)S(t).  The  factor  of  Nq/2  is  customarily  used,  since  it  is  the 
level  of  the  corresponding  PSD  (see  Example  17.11).  The  random  process  is  called 
continuous-time  white  Gaussian  noise  (WGN).  This  was  previously  described  in 
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Example  17.11.  Note  that  in  addition  to  the  samples  being  uncorrelated  (since 
7"x(t)  =  0  for  t  7^  0),  they  are  also  independent  because  of  the  Gaussian  assump¬ 
tion.  Unfortunately,  for  continuous-time  WGN,  it  is  not  possible  to  explicitly  write 
down  the  multivariate  Gaussian  PDF  since  rx(0)  — >  oo.  Instead,  as  explained  in 
Example  17.11  we  use  continuous-time  WGN  only  as  a  model,  reserving  any  proba¬ 
bility  calculations  for  the  random  process  at  the  output  of  some  filter,  whose  input 
is  WGN.  This  is  illustrated  next. 

0 

Example  20.7  —  Continuous-time  Wiener  random  process  or  Brownian 
motion 

Let  U ( t )  be  WGN  and  define  the  semi-infinite  random  process 

x(t)=  fuma  t>  o. 

Jo 

This  random  process  is  called  the  Wiener  random  process  and  is  often  used  as  a 
model  for  Brownian  motion.  It  is  the  continuous-time  equivalent  of  the  discrete¬ 
time  random  process  of  Example  20.3.  A  typical  realization  of  the  Wiener  random 
process  is  shown  in  Figure  20.6  (see  Problem  20.18  on  how  this  was  done).  Note  that 


Figure  20.6:  Typical  realization  of  the  Wiener  random  process. 

because  of  its  construction  as  the  “sum”  of  independent  and  identically  distributed 
random  variables  (the  J7(t)’s),  the  increments  are  also  independent  and  stationary. 
To  prove  that  X(t)  is  a  Gaussian  random  process  is  somewhat  tricky  in  that  it  is 
an  uncountable  “sum”  of  independent  random  variables  ?7(£)  for  0  <  £  <  t.  We  will 
take  it  on  faith  that  any  integral,  which  is  a  linear  transformation,  of  a  continuous¬ 
time  Gaussian  random  process  produces  another  continuous-time  Gaussian  random 
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process  (see  also  Problem  20.16  for  a  heuristic  proof).  As  such,  we  need  only  deter¬ 
mine  the  mean  and  covariance  functions.  These  are  found  as 


E[X(t)\ 


E[X(ti)X(t2)]  =  E 


f  (6K i  f2  U(b)dt2 

Jo  Jo 


IJ  0 

ti  rt2 


n 

f 


EMSOUfa)] 


ru(&-€i)={No/2)5(£2-t;i) 

2  /„"  (C 


To  evaluate  the  double  integral  we  first  examine  the  inner  integral  and  assume  that 
t2  >  t\.  Then,  the  function  $(£2  —  £1)  with  £1  fixed  is  integrated  over  the  interval 
0  <  £2  <  ^2  as  shown  in  Figure  20.7.  It  is  clear  from  the  figure  that  if  we  fix  £1 


£2 


integrate  along  here  first 


Figure  20.7:  Evaluation  of  double  integral  of  Dirac  delta  function  for  the  case  of 
t2  >  t\. 

and  integrate  along  £25  then  we  will  include  the  impulse  in  the  inner  integral  for  all 
£1.  (This  would  not  be  the  case  if  £2  <  h  as  one  can  easily  verify  by  redrawing  the 
rectangle  for  this  condition.)  As  a  result,  if  £2  >  ti,  then 

[  <5(6  -  £1  )d&  =  1  for  all  0  <  £1  <  *1 

Jo 


and  therefore 
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and  similarly  if  t2  <  h,  we  will  have  E[X(ti)X(t2)]  =  (N0/2)t2.  Combining  the  two 
results  produces 

E[X(h)X(t2)]  =  ^  min(i1,  t2)  (20.16) 

which  should  be  compared  to  the  discrete-time  result  obtained  in  Problem  16.26. 
Hence,  the  joint  PDF  of  the  samples  of  a  Wiener  random  process  is  a  multivariate 
Gaussian  PDF  with  mean  vector  equal  to  zero  and  covariance  matrix  having  as  its 
(i,  j)th  element 

[c)ij  =  ^min  (tutj). 

Note  that  from  (20.16)  with  t\  —  £2  =  £,  the  PDF  of  X(t)  is  J\f( 0,  ( No/2)t ).  Clearly, 
the  Wiener  random  process  is  a  nonstationary  correlated  random  process  whose  mean 
is  zero ,  variance  increases  with  time ,  and  marginal  PDF  is  Gaussian. 

0 

In  the  next  section  we  explore  some  other  important  continuous-time  Gaussian  ran¬ 
dom  processes  often  used  as  models  in  practice. 


20.7  Special  Continuous-Time 

Gaussian  Random  Processes 

20.7.1  Rayleigh  Fading  Sinusoid 

In  Example  16.11  we  studied  a  discrete-time  randomly  phased  sinusoid.  Here  we 
consider  the  continuous-time  equivalent  for  that  random  process,  which  is  given  by 
X(t)  —  Acos(27tFo£  +  0),  where  A  >  0  is  the  amplitude,  Fq  is  the  frequency  in  Hz, 
and  0  is  the  random  phase  with  PDF  W(0, 27r).  We  now  further  assume  that  the 
amplitude  is  also  a  random  variable.  This  is  frequently  a  good  model  for  a  sinu¬ 
soidal  signal  that  is  subject  to  multipath  fading.  It  occurs  when  a  sinusoidal  signal 
propagates  through  a  medium,  e.g.,  an  electromagnetic  pulse  in  the  atmosphere  or  a 
sound  pulse  in  the  ocean,  and  reaches  its  destination  by  several  different  paths.  The 
constructive  and  destructive  interference  of  several  overlapping  sinusoids  causes  the 
received  waveform  to  exhibit  amplitude  fluctuations  or  fading.  An  example  of  this 
was  given  in  Figure  20.2.  However,  over  any  short  period  of  time,  say  5  <  t  <  5.5 
seconds,  the  waveform  will  have  approximately  a  constant  amplitude  and  a  constant 
phase  as  shown  in  Figure  20.8.  Because  the  amplitude  and  phase  are  not  known  in 
advance,  we  model  them  as  realizations  of  random  variables.  That  the  waveform 
does  not  maintain  the  constant  amplitude  level  and  phase  outside  of  the  small  inter¬ 
val  will  be  of  no  consequence  to  us  if  we  are  only  privy  to  observing  the  waveform 
over  a  small  time  interval.  Hence,  a  reasonable  model  for  the  random  process  (over 
the  small  time  interval)  is  to  assume  a  random  amplitude  and  random  phase  so  that 
X(t)  =  Acos(2nFot  +  0),  where  A  and  ©  are  random  variables.  A  more  convenient 
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Figure  20.8:  Segment  of  waveform  shown  in  Figure  20.2  for  5  <  t  <  5.5  seconds. 


form  is  obtained  by  expanding  the  sinusoid  as 


X(t )  =  A  cos(2nFot  +  ©) 

=  A  cos(0)  cos(27rFo£)  —  A  sin(@)  sin(27rFot) 
=  U  cos(2tt F0t)  -  V  sm(2irFot) 


where  we  have  let  A  cos(@)  =  [/,  Asin(0)  =  V.  Clearly,  since  A  and  @  are  random 
variables,  so  are  U  and  V.  Since  the  physical  waveform  is  due  to  the  sum  of  many 
sinusoids,  we  once  again  use  a  central  limit  theorem  argument  to  assume  that  U  and 
V  are  Gaussian.  Furthermore,  if  we  assume  that  they  are  independent  and  have  the 
same  PDF  of  Af( 0,  cr2),  we  will  obtain  PDFs  for  the  amplitude  and  phase  which  are 
found  to  be  valid  in  practice.  With  the  Gaussian  assumptions  for  U  and  F,  the 
random  amplitude  becomes  a  Rayleigh  distributed  random  variable,  the  random 
phase  becomes  a  uniformly  distributed  random  variable,  and  the  amplitude  and 
phase  random  variables  are  independent  of  each  other.  To  see  this  note  that  since 
U  —  t4cos(@),  V  =  Asin(0),  we  have  A  —  \/U2  +  V2  and  0  =  arctan(F/£7).  It 
was  shown  in  Example  12.12  that  if  X  ~  Af( 0,  a2),  Y  ~  Af( 0,  a2),  and  X  and  Y  are 
independent,  then  R  =  \JX2  +  Y2  is  a  Rayleigh  random  variable,  0  =  arctan(y/X) 
is  a  uniformly  distributed  random  variable,  and  R  and  0  are  independent.  Hence,  we 
have  that  for  the  random  amplitude/random  phase  sinusoid  X(t)  =  Acos(27rFot  + 
0),  the  amplitude  has  the  PDF 


Pa(o) 


4rexp(-i^)  a>0 
0  a  <  0 


and  the  phase  has  the  PDF  ©  ~  ZY(0, 27r),  and  A  and  0  are  independent.  This 
model  is  usually  referred  to  as  the  Rayleigh  fading  sinusoidal  model.  It  is  also  a 
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Gaussian  random  process  since  all  sets  of  K  samples  can  be  written  as 


'  X(h)  - 
X(t2) 

cos(27rF0*i) 

cos(27rFo^2) 

—  sin(27r.Foti ) 

-  sin(27rFo<2) 

'  u 

• 

• 

B 

.  X(tK)  _ 

• 

a 

a 

cos(27rF0ttf) 

a 

a 

a 

—  sin(27rFo^ic) 

_  V 

which  is  a  linear  transformation  of  the  Gaussian  random  vector  [U  V]T ,  and  so  has 
a  multivariate  Gaussian  PDF.  (For  K  >  2  the  covariance  matrix  will  be  singular, 
so  that  to  be  more  rigorous  we  would  need  to  modify  our  definition  of  the  Gaussian 
random  process.  This  would  involve  the  characteristic  function  which  exists  even 
for  a  singular  covariance  matrix.)  Furthermore,  X(t)  is  WSS,  as  we  now  show.  Its 
mean  is  zero  since  E[U]  =  E[V]  =  0  and  its  ACF  is 

rx(r) 


=  E[X(t)X{t  +  r)] 

=  E[[U  cos(2nFot)  —  V  sm(27rFot)][U  cos(27r.Fo(£  +  r))  —  V  sin(27rFo(£  +  r))]] 

=  E[U 2]  cos(27rjFo£)  cos(27TjFo(£  +  r))  +  E[V 2]  sin(27rjFo^)  sin(27rFo(t  +  r)) 

=  a2[cos(27rFot)  cos(2nFo(t  +  r))  +  sin(27rFot)  sin(27rFo(t  +  r))] 

—  a2  cos(27rFor)  (20.17) 

where  we  have  used  E[UV]  =  E[U]E[V]  =  0  due  to  independence.  Its  PSD  is 
obtained  by  taking  the  Fourier  transform  to  yield 

Px(F)  =  y  8(F  +  Fo)  +  y<5 (F  -  Fo)  (20.18) 

and  it  is  seen  that  all  its  power  is  concentrated  at  F  =  Fo  as  expected. 

20.7.2  Bandpass  Random  Process 

The  Rayleigh  fading  sinusoid  model  assumed  that  our  observation  time  was  short. 
Within  that  time  window,  the  sinusoid  exhibits  approximately  constant  amplitude 
and  phase.  If  we  observe  a  longer  time  segment  of  the  random  process  whose  typ¬ 
ical  realization  is  shown  in  Figure  20.2,  then  the  constant  in  time  (but  random 
amplitude/ random  phase)  sinusoid  is  not  a  good  model.  A  more  realistic  but  more 
complicated  model  is  to  let  both  the  amplitude  and  phase  be  random  processes  so 
that  they  vary  in  time.  As  such,  the  random  process  will  be  made  up  of  many  fre¬ 
quencies,  although  they  will  be  concentrated  about  F  =  Fq.  Such  a  random  process 
is  usually  called  a  narrowband  random  process.  Our  model,  however,  will  actually 
be  valid  for  a  bandpass  random  process  whose  PSD  is  shown  in  Figure  20.9.  Hence, 
we  will  assume  that  the  bandpass  random  process  can  be  represented  as 

X(t)  =  A(t)  cos(27rFo£  +  ©(£)) 

=  A(t)  cos (©(£))  cos(27rFo*)  ~  A(t)  sin {®(t))  sm(2nFot) 
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Px(F ) 


F 

t?  W  rn  ,  W 
^0  ~  ~2  ^0  +  ~2 

Figure  20.9:  Typical  PSD  for  bandpass  random  process.  The  PSD  is  assumed  to  be 
symmetric  about  F  =  Fq  and  also  that  Fo  >  W/ 2. 

where  Afa  and  ©(£)  are  now  random  processes.  As  before  we  let 

Ufa  =  Afa  cos(0(£)) 

Vfa  =  Afa  sin(0(t)) 

so  that  we  have  as  our  model  for  a  bandpass  random  process 

Xfa  =  U(t)  cos(2t rF0t)  -  Vfa  sin(27r F0t).  (20.19) 

The  X  ( t )  random  process  is  seen  to  be  a  modulated  version  of  U ( t )  and  V  fa  (mod¬ 
ulation  meaning  that  Ufa  and  V(t )  are  multiplied  by  cos(27rFot)  and  sin(27rjFot), 
respectively).  This  modulation  shifts  the  PSD  of  U ( t )  and  V (t)  to  be  centered  about 
F  —  Fq.  Therefore,  U(t)  and  V(t)  must  be  slowly  varying  or  lowpass  random  pro¬ 
cesses.  As  a  suitable  description  of  U(t)  and  V(t)  we  assume  that  they  are  each  zero 
mean  lowpass  Gaussian  random  processes ,  independent  of  each  other ,  and  jointly 
WSS  ( see  Chapter  19)  with  the  same  ACF,  rjj(r)  =  ry(r).  Then,  as  before  X(t)  is 
a  zero  mean  Gaussian  random  process,  which  as  we  now  show  is  also  WSS.  Clearly, 
since  both  U(t)  and  V(t)  are  zero  mean,  from  (20.19)  so  is  X(t),  and  the  ACF  is 

rx  (t)  =  E[X  (t)X  ( t  +  r )]  (20.20) 

=  E[[U ( t )  cos(27i\Fo£)  —  V (t)  sin(27rFo£)] 

m[U(t  +  r)  cos(27rFo(t  +  r))  —  V(t  +  r)  sin(27rFo(t  +  r))] 

=  ru(r)  cos(27rFot)  cos(27rFo(t  +  r))  +  ry(r)  sin(27rFot)  sin(27rFo(t  +  r)) 
=  ru(r)  cos(27tFot)  (20.21) 

since  E[U (ti)V  fa)]  —  0  for  all  t\  and  1 2  due  to  the  independence  assumption,  and 
vu{t)  =  ry(r)  by  assumption.  Note  that  this  extends  the  previous  case  in  which 
Ufa  —  U  and  Vfa  =  V  and  rufa  =  a2  (see  (20.17)).  The  PSD  is  found  by  taking 
the  Fourier  transform  of  the  ACF  to  yield 

PX{F)  =  l-Pv{F  +  F0)  +  l-Pu{F  -  Fo). 


(20.22) 
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If  U(t)  and  V(t)  have  the  lowpass  PSD  shown  in  Figure  20.10,  then  in  accordance 
with  (20.22)  Px{F)  is  given  by  the  dashed  curve.  As  desired,  we  now  have  a  repre- 

Pu(F)  =  Pv(F) 


Figure  20.10:  PSD  for  lowpass  random  processes  U(t)  and  V(t).  The  PSD  for  the 
bandpass  random  process  X(t)  is  shown  as  the  dashed  curve. 

sentation  for  a  bandpass  random  process.  It  is  obtained  by  modulating  two  lowpass 
random  processes  U(t)  and  V(t )  up  to  a  center  frequency  of  Fo  Hz.  Hence,  (20.19) 
is  called  the  bandpass  random  process  representation  and  since  the  random  process 
may  either  represent  a  signal  or  noise,  it  is  also  referred  to  as  the  bandpass  signal 
representation  or  the  bandpass  noise  representation.  Note  that  because  Pu(F )  is 
symmetric  about  F  =  0,  Px(F)  must  be  symmetric  about  F  =  Fo.  To  represent 
bandpass  PSDs  that  are  not  symmetric  requires  the  assumption  that  U(t)  and  V{t) 
are  correlated  [Van  Trees  1971]. 

In  summary,  to  model  a  WSS  Gaussian  random  process  X(t)  that  has  a  zero 
mean  and  a  bandpass  PSD  given  by 

PX(F)  =  ^PV(F  +  Fo)  +  l-Pu{F  -  F0) 

where  Pu{F )  =  0  for  |F|  >  W/2  as  shown  in  Figure  20.10  by  the  dashed  curve,  we 
use 

X(t)  =  U (t)  cos(2nFot)  —  V(t)  sin(27rFo£). 

The  assumptions  are  that  C7 (t),  V(t)  are  each  Gaussian  random  processes  with  zero 
mean,  independent  of  each  other  and  each  is  WSS  with  PSD  Pu{F )•  The  random 
processes  U(t),V(t)  are  lowpass  random  processes  and  are  sometimes  referred  to 
as  the  in  phase  and  quadrature  components  of  X(t).  This  is  because  the  “carrier” 
sinusoid  cos(27rFo£)  is  in  phase  with  the  sinusoidal  carrier  in  U ( t )  cos{2nFot)  and  90° 
out  of  phase  with  the  sinusoidal  carrier  in  V  ( t )  sin(27rFot).  See  Problem  20.24  on  how 
to  extract  the  lowpass  random  processes  from  X(t).  In  addition,  the  amplitude  of 
X(t),  which  is  y/U2{t)  +  V2(t)  is  called  the  envelope  of  X(t).  This  is  because  if  X{t) 
is  written  as  X(t)  =  y/U2(t)  +  V2(t)  cos(2nFot  +  arctan(V(i)/[/(£)))  (see  Problem 
12.42)  the  envelope  consists  of  the  maximums  of  the  waveform.  An  example  of  a 
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deterministic  bandpass  signal,  given  for  sake  of  illustration,  is  s(t)  =  3t  cos(27r20£)  — 
sin(27r20t)  for  0  <  t  <  1,  and  is  shown  in  Figure  20.11.  Note  that  the  envelope 
is  >/(3 1)2  +  (4 1)2  =  5\t\.  For  a  bandpass  random  process  the  envelope  will  also  be  a 
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Figure  20.11:  Plot  of  the  deterministic  bandpass  signal  s(t)  =  3t cos(27r20£)  — 
At  sm(2n20t)  for  0  <  t  <  1.  The  envelope  is  shown  as  the  dashed  line. 

random  process.  Since  U ( t )  and  V ( t )  both  have  the  same  ACF,  the  characteristics  of 
the  envelope  depend  directly  on  ru(r).  An  illustration  is  given  in  the  next  example. 

Example  20.8  -  Bandpass  random  process  envelope 

Consider  the  bandpass  Gaussian  random  process  whose  PSD  is  shown  in  Figure 
20.12.  This  is  often  used  as  a  model  for  bandpass  “white”  Gaussian  noise.  It  results 
from  having  filtered  WGN  with  a  bandpass  filter.  Note  that  from  (20.22)  the  PSD 

Px(F) 


F 

^0  “  ~2  ^0  +  ~2 

Figure  20.12:  PSD  for  bandpass  “white”  Gaussian  noise. 
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of  U  ( t )  and  V  (t)  must  be 


Pu(F)  =  PV(F)  = 


No  \F\<f 
0  \F\  >  f 


and  therefore  by  taking  the  inverse  Fourier  transform,  the  ACF  becomes  (see  also 
(17.55)  for  a  similar  calculation) 

Mr)  =  Mr)  =  NoWSm^T\  (20.23) 

The  correlation  between  two  samples  of  the  envelope  will  be  approximately  zero  when 
t  >  \/W  since  then  rjj(r)  =  ry(r)  «  0.  Examples  of  some  bandpass  realizations 
are  shown  for  Fq  =  20  Hz,  W  =  1  Hz  in  Figure  20.13a  and  for  Fq  =  20  Hz,  W  =  4 
Hz  in  Figure  20.13b.  The  time  for  which  two  samples  must  be  separated  before  they 
become  uncorrelated  is  called  the  correlation  time  tc.  It  is  defined  by  rx(r)  ~  0  for 
r  >  Tc.  Here  it  is  rc  «  1  /W,  and  is  shown  in  Figure  20.13. 


t  (sec) 


(a)  F0  =  20  Hz,  W  =  1  Hz 


(b)  F0  =  20  Hz,  W  =  4  Hz 


Figure  20.13:  Typical  realizations  of  bandpass  “white”  Gaussian  noise.  The  PSD  is 
given  in  Figure  20.12. 


A  typical  probability  calculation  might  be  to  determine  the  probability  that  the 
envelope  at  t  =  to  exceeds  some  threshold  7.  Thus,  we  wish  to  find  P[A(to)  >  7], 
where  A(to)  =  y/U2(to)  +  V2(to).  Since  the  U ( t )  and  V ( t )  are  independent  Gaussian 
random  processes  with  U ( t )  A/*(0,  a2)  and  V(t )  ~  A/*(0,cr2),  it  follows  that  A(to) 
is  a  Rayleigh  random  variable.  Hence,  we  have  that 


P[A{t0)  >  7] 
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To  complete  the  calculation  we  need  to  determine  a2.  But  a2  =  E[U2(to)\  =  'f'u [0]  = 
NqW  from  (20.23).  Therefore,  we  have  that 


P[A(t0)  >  7]  =  exp  (-57^7)  • 


20.8  Computer  Simulation 

We  now  discuss  the  generation  of  a  realization  of  a  discrete-time  Gaussian  ran¬ 
dom  process.  The  generation  of  a  continuous-time  random  process  realization  can 
be  accomplished  by  approximating  it  by  a  discrete-time  realization  with  a  suffi¬ 
ciently  small  time  interval  between  samples.  We  have  done  this  to  produce  Figure 
20.6  (see  also  Problem  20.18).  In  particular,  we  wish  to  generate  a  realization 
of  a  WSS  Gaussian  random  process  with  mean  zero  and  ACS  rx[k\  or  equiva¬ 
lently  a  PSD  Px(f  )•  For  nonzero  mean  random  processes  we  need  only  add  the 
mean  to  the  realization.  The  method  is  based  on  Theorem  20.4.1,  where  we  use  a 
WGN  random  process  U[n\  as  the  input  to  an  LSI  filter  with  frequency  response 
H(f).  Then,  we  know  that  the  output  random  process  will  be  WSS  and  Gaus¬ 
sian  with  a  PSD  Px{f )  =  \H{f)\2Pjj{f).  Now  assuming  that  Pu(f)  =  cr^  =  1, 
so  that  Px(f)  =  |iir(/)|2,  we  see  that  a  filter  whose  frequency  response  magni¬ 
tude  is  \H(f)\  =  y/Px(f)  and  whose  phase  response  is  arbitrary  (but  must  be 
an  odd  function)  will  be  required.  Finding  the  filter  frequency  response  from  the 
PSD  is  known  as  spectral  factorization  [Priestley  1981].  As  special  cases  of  this 
problem,  if  we  wish  to  generate  either  the  AR,  MA,  or  ARM  A  Gaussian  random 
processes  described  in  Section  20.4,  then  the  filters  are  already  known  and  have 
been  implemented  as  difference  equations.  For  example,  the  MA  random  pro¬ 
cess  is  generated  by  filtering  U[n\  with  the  LSI  filter  whose  frequency  response 
is  H(f )  —  1  —  6exp(— j27r/).  This  is  equivalent  to  the  implementation  using  the 
difference  equation  X[n\  =  U[n]  —  bU[n  —  1].  For  higher-order  (more  coefficients) 
AR,  MA,  and  ARMA  random  processes,  the  reader  should  consult  [Kay  1988]  for 
how  the  appropriate  coefficients  can  be  obtained  from  the  PSD.  Also,  note  that  the 
problem  of  designing  a  filter  whose  frequency  response  magnitude  approximates  a 
given  one  is  called  digital  filter  design.  Many  techniques  are  available  to  do  this 
[Jackson  1996].  We  next  give  a  simple  example  of  how  to  generate  a  realization  of 
a  WSS  Gaussian  random  process  with  a  given  PSD. 

Example  20.9  —  Filter  determination  to  produce  Gaussian  random  pro¬ 
cess  with  given  PSD 

Assume  we  wish  to  generate  a  realization  of  a  WSS  Gaussian  random  process  with 
zero  mean  and  PSD  Px(f)  =  (1  +  cos(47r/))/2.  Then,  for  Pjj{f)  =  1  the  magnitude 
of  the  frequency  response  should  be 

\H(f)\  =  ^(1  +  008(477/)). 
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We  will  choose  the  phase  response  or  Z-H(f )  =  0(f)  to  be  any  convenient  function. 
Thus,  we  wish  to  determine  the  impulse  response  h[k]  of  the  filter  whose  frequency 
response  is  _ 

H(f)  =  yj  i(l  +  cos(4t r/))  exp(j0(/)) 

since  then  we  can  generate  the  random  process  using  a  convolution  sum  as 

oo 

m=  E  h[k]U[n  -  k).  (20.24) 

k— — oo 


The  impulse  response  is  found  as  the  inverse  Fourier  transform  of  the  frequency 
response 


'  i  H(f)  exp(j27rfn)df 
~2 

\[w  +  cos(4tt/))  exp  (j0(f))  exp(j2nfn)df 

~2 


—  oo  <  n  <  oo. 


This  can  be  evaluated  by  noting  that 

1^(1  +cos(4t r/))  = 


cos(2a)  =  cos2 (a)  —  sin2 (a)  and  therefore 

\[\(  1  +  cos2(27t/)  -  sin2(27r/)) 
\/cos2(27r/) 


cos(27 if)  . 


Thus, 


cos(2tt/)|  exp(j0(/))  exp(j27r/n)d/ 


and  we  choose  exp {j0(f))  =  1  if  cos(27 r/)  >  0  and  exp =  —1  if  cos(27 r/)  <  0. 
This  produces 

h[n]  =  cos(27r/)exp(j27r/n)d/ 
which  is  easily  shown  to  evaluate  to 


Hence,  from  (20.24)  we  have  that 


n  =  ±1 
otherwise. 
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Note  that  the  filter  is  noncausal.  We  could  also  use  X[n]  =  \U[n\  +  \U[n  —  2] 
if  a  causal  filter  is  desired  and  still  obtain  the  same  PSD  (see  Problem  20.28). 

❖ 

Finally,  it  should  be  pointed  out  that  an  alternative  means  of  generating  successive 
samples  of  a  zero  mean  Gaussian  WSS  random  process  is  by  applying  a  matrix 
transformation  to  a  vector  of  independent  Af( 0, 1)  samples.  If  a  realization  of  X  = 
[X[0]  X[l] . . .  X[N  —  1]]T,  where  X  ~  A/*(0,  Rx)  and  Rx  is  the  N  x  N  Toeplitz 
autocorrelation  matrix  given  in  (17.23)  is  desired,  then  the  method  described  in 
Section  14.9  can  be  used.  We  need  only  replace  C  by  Rx-  For  a  nonzero  mean  WSS 
Gaussian  random  process,  we  add  the  mean  /i  to  each  sample  after  this  procedure 
is  employed.  The  only  drawback  is  that  the  realization  is  assumed  to  consist  of 
a  fixed  number  of  samples  iV,  and  so  for  each  value  of  N  the  procedure  must  be 
repeated.  Filtering,  as  previously  described,  allows  any  number  of  samples  to  be 
easily  generated. 

20.9  Real-World  Example  —  Estimating  Fish  Popula¬ 
tions 

Of  concern  to  biologists,  and  to  us  all,  is  the  fish  population.  Traditionally,  the 
population  has  been  estimated  using  a  count  produced  by  a  net  catch.  However, 
this  is  expensive,  time  consuming,  and  relatively  inaccurate.  A  better  approach 
is  therefore  needed.  In  the  introduction  we  briefly  indicated  how  an  echo  sonar 
would  produce  a  Gaussian  random  process  as  the  reflected  waveform  from  a  school 
of  fish.  We  now  examine  this  in  more  detail  and  explain  how  estimation  of  the  fish 
population  might  be  done.  The  discussion  is  oversimplified  so  that  the  interested 
reader  may  consult  [Ehrenberg  and  Lytle  1972,  Stanton  1983,  Stanton  and  Chu 
1998]  for  more  detail.  Referring  to  Figure  20.14  a  sound  pulse,  which  is  assumed 
to  be  sinusoidal,  is  transmitted  from  a  ship.  As  it  encounters  a  school  of  fish,  it 
will  be  reflected  from  each  fish  and  the  entire  waveform,  which  is  the  sum  of  all  the 
reflections,  will  be  received  at  the  ship.  The  received  waveform  will  be  examined 
for  the  time  interval  from  t  —  2Rm[n/c  to  t  =  2 i?max/c,  where  jRmin  and  Rm ax  are 
the  minimum  and  maximum  ranges  of  interest,  respectively,  and  c  is  the  speed  of 
sound  in  the  water.  This  corresponds  to  the  time  interval  over  which  the  reflections 
from  the  desired  ranges  will  be  present.  Based  on  the  received  waveform  we  wish 
to  estimate  the  number  of  fish  in  the  vertical  direction  in  the  desired  range  window 
from  i?min  to  jRmax-  Note  that  only  the  fish  within  the  nearly  dashed  vertical  lines, 
which  indicate  the  width  of  the  transmitted  sound  energy,  will  produce  reflections. 
For  different  angular  regions  other  pulses  must  be  transmitted.  As  discussed  in  the 
introduction,  if  there  are  a  large  number  of  fish  producing  reflections,  then  by  the 
central  limit  theorem,  the  received  waveform  can  be  modeled  as  a  Gaussian  random 
process.  As  shown  in  Figure  20.14  the  sinusoidal  pulse  first  encounters  the  fish 
nearest  in  range,  producing  a  reflection,  while  the  fish  farthest  in  range  produces 
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the  last  reflection.  As  a  result,  the  many  reflected  pulses  will  overlap  in  time,  with 
two  of  the  reflected  pulses  shown  in  the  figure.  Hence,  each  reflected  pulse  can  be 


i  i 

t  i 

■  i 


Figure  20.14:  Fish  counting  by  echo  sonar. 

represented  as 

Xi(t)  =  A{  cos(27 rFo(t  -  n)  +  @i) 

where  Fq  is  the  transmit  frequency  in  Hz  and  rz  =  2 Ri/c  is  the  time  delay  of  the 
pulse  reflected  from  the  ith  fish.  As  explained  in  the  introduction,  since  A*,  ©*  will 
depend  upon  the  fish’s  position,  orientation,  and  motion,  which  are  not  known  a 
priori,  we  assume  that  they  are  realizations  of  random  variables.  Futhermore,  since 
the  ranges  of  the  individual  fish  are  unknown,  we  also  do  not  know  T{.  Hence,  we 
replace  (20.25)  by 

Xi(t)  =  Ai  cos(27rFo£  +  ©') 

where  ©'  =  ©*  —  2'kF§T{  (which  is  reduced  by  multiples  of  2ir  until  it  lies  within  the 
interval  (0, 27r)),  and  model  ©^  as  a  new  random  variable.  Hence,  for  N  reflections 
we  have  as  our  model 


(20.25) 
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and  letting  Ui  =  A{  cos(0')  and  Vt  =  Ai  sin(0'),  we  have 


N 

X(t)  —  ^^(Ui  cos(27rFot)  —  Vi  sin(27rFot)) 

2=1 

=  C/A  cos(2tt F0t)  -  vA  sin(27rF0i) 

=  U  cos(2nFot)  —  V  sin(27rFo£) 


where  U  =  ^  and  ^  —  12iLi  V-  We  assume  that  all  the  fish  are  about  the 

same  size  and  hence  the  echo  amplitudes  are  about  the  same.  Then,  since  U  and  V 
are  the  sums  of  random  variables  that  we  assume  are  independent  (reflection  from 
one  fish  does  not  affect  reflection  from  any  of  the  others)  and  identically  distributed 
(fish  are  same  size),  we  use  a  central  limit  theorem  argument  to  postulate  a  Gaussian 
PDF  for  U  and  V.  We  furthermore  assume  that  U  and  V  are  independent  so  that  if 
E[Ui]  =  E[Vi ]  —  0  and  var(?7j)  =  var(V^)  =  cr2,  then  we  have  that  U  ~  JV^O,  iVcr2), 
V  ~  J\f( 0,  iVcr2),  and  U  and  V  are  independent.  This  is  the  Rayleigh  fading  sinusoid 
model  discussed  in  Section  20.7.  As  a  result,  the  envelope  of  the  received  waveform 
X(£),  which  is  given  by  A  —  y/U2  +  V2  has  a  Rayleigh  PDF.  Specifically,  it  is 


Pa{o)  = 


Na2  eXP  (  2  jfc* )  a  —  0 

0  a  <  0. 


Hence,  if  we  have  previously  measured  the  reflection  characteristics  of  a  single  fish, 
then  we  will  know  cr2.  To  estimate  N  we  recall  that  the  mean  of  the  Rayleigh 
random  variable  is 


so  that  upon  solving  for  AT,  we  have 


To  estimate  the  mean  we  can  transmit  a  series  of  M  pulses  and  measure  the  en¬ 
velope  for  each  received  waveform  Xm(t )  for  m  =  1, 2 . . . ,  M.  Calling  the  envelope 

A 

measurement  for  the  mth  pulse  Am.  we  can  form  the  estimator  for  the  number  of 
fish  as 


(20.26) 


See  Problem  20.20  on  how  to  obtain  Am  =  \JU ^  from  Xm(t).  It  is  shown 

there  that  Um  =  [2Xm(t)  cos(27tFo^)]lpf  and  Vm  =  [—2Xm(t)  sin(27rFot)]LPF>  where 
the  designation  “LPF”  indicates  that  the  time  waveform  has  been  lowpass  filtered. 
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Problems 

20.1.  (w)  Determine  the  probability  that  5  successive  samples  (Jf  [0],  A[l],  X[2],  X[3], 
A  [4]}  of  discrete-time  WGN  with  cr^  =  1  will  all  exceed  zero.  Then,  repeat 
the  problem  if  the  samples  are  {A [10],  X[ll],  X[12],  X[13],  A [14]}. 

20.2  (^)  (w)  If  X[n]  is  the  random  process  described  in  Example  20.2,  find  P[X[0]  > 
0,X[3]  >0]  if  erg,  =  1. 

20.3  (w)  If  X[n\  is  a  discrete-time  Wiener  random  process  with  var(X[n])  =  2 (n  + 
1),  determine  P[— 3  <  X[S\  <  3]. 

20.4  (w)  A  discrete-time  Wiener  random  process  X[n\  is  input  to  a  differencer  to 
generate  the  output  random  process  Y[n]  =  X[n]  —  X[n  —  1].  Describe  the 
characteristics  of  the  output  random  process. 
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20.5  (o)  (w)  K  discrete-time  WGN  X[n\  with  o\  =  1  is  input  to  a  differencer  to 
generate  the  output  random  process  Y[n\  =  X[n\  —  X[n  —  1],  find  the  PDF  of 
the  samples  Y[0],  Y[l].  Are  the  samples  independent? 

20.6  (w)  If  in  Example  20.4  the  input  random  process  to  the  differencer  is  an 
AR  random  process  with  parameters  a  and  Ojj  —  1,  determine  the  PDF  of 
Y[0],Y[1].  What  happens  as  a  -»  1?  Explain  your  results. 

20.7  (t)  In  this  problem  we  argue  that  if  X[n\  is  a  Gaussian  random  process 
that  is  input  to  an  LSI  filter  so  that  the  output  random  process  is  Y[n]  = 

ooh[i\X[n  —  i],  then  Y[n]  is  also  a  Gaussian  random  process.  To  do 

so  consider  a  finite  impulse  response  filter  so  that  Y[n]  =  X)[=o  h[i\X[n  —  i] 
with  I  =  4  (the  infinite  impulse  response  filter  argument  is  a  bit  more  com¬ 
plicated  but  is  similar  in  nature)  and  choose  to  test  the  set  of  output  samples 
n\  —  0,  ri2  —  l,n3  =  2  so  that  K  —  3  (again  the  more  general  case  pro¬ 
ceeds  similarly).  Now  prove  that  the  output  samples  have  a  3-dimensional 
Gaussian  PDF.  Hint:  Show  that  the  samples  of  Y[n\  are  obtained  as  a  linear 
transformation  of  X[n\. 

20.8  (w)  A  discrete-time  WGN  random  process  is  input  to  an  LSI  filter  with  system 

function  %{z)  =  z  —  z~l.  Determine  the  PDF  of  the  output  samples  Y[n]  for 
n  =  0,  —  1.  Are  any  of  these  samples  independent  of  each  other? 

20.9  (t)  In  this  problem  we  prove  that  if  X[n]  and  Y[n]  are  both  Gaussian  random 
processes  that  are  independent  of  each  other,  then  Z[n]  =  X[n]  +  Y  [n]  is  also  a 
Gaussian  random  process.  To  do  so  we  prove  that  the  characteristic  function 
of  Z  =  [Z[n\]  Z[ri2]  .  • .  Z[tik]]t  is  that  of  a  Gaussian  random  vector.  First  note 
that  since  X  =  [X[ni]  X[n 2] . . .  X[riK]]T  and  Y  =  [Y[ni]  Y[ri2] . . .  Y[tik]]t  are 
both  Gaussian  random  vectors  (by  definition  of  a  Gaussian  random  process), 
then  each  one  has  the  characteristic  function 

=  exp 

where  u>  =  [cji  002  •  •  •  u k]T •  Next  use  the  property  that  the  characteristic  func¬ 
tion  of  a  sum  of  independent  random  vectors  is  the  product  of  the  characteristic 
functions  to  show  that  Z  has  a  if -dimensional  Gaussian  PDF. 

20.10  (^)  (w)  Let  X[n]  and  Y[n]  be  WSS  Gaussian  random  processes  with  zero 
mean  and  independent  of  each  other.  It  is  known  that  Z[n\  =  X[n] Y[n]  is  not 
a  Gaussian  random  process.  However,  can  we  say  that  Z[n]  is  a  WSS  random 
process,  and  if  so,  what  is  its  mean  and  PSD? 

20.11  (w)  An  AR  random  process  is  described  by  X[n]  =  \X[n  -  1]  +  C/[n],  where 
U[n]  is  WGN  with  =  1.  This  random  process  is  input  to  an  LSI  filter  with 
system  function  H(z)  =  1  —  \z~x  to  generate  the  output  random  process  Y[n\. 
Find  P[Y2[0]  +  Y2[l]  >  1].  Hint:  Consider  X[n]  as  the  output  of  an  LSI  filter. 
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20.12  (t)  We  prove  (20.11)  in  this  problem  by  using  the  method  of  characteris¬ 
tic  functions.  Recall  that  for  a  multivariate  zero  mean  Gaussian  PDF  the 
characteristic  function  is 


</>x(w)  =  exp  (— 

and  the  fourth-order  moment  can  be  found  using  (see  Section  14.6) 

™r,r  xr  xr  xr  1  dVxM  I 


E[XlX2X3X4]  = 


dcjid^dusduj^  u=0' 


Although  straightforward,  the  algebra  is  tedious  (see  also  Example  14.5  for  the 
second-order  moment  calculations).  To  avoid  frustration  (with  P [frustration]  = 
1)  note  that 

4  4 

uTCu}  =  UiUjEiXiXj] 

i= 1  3=1 

and  let  Lt  =  u)jE[XiXj].  Next  show  that 


d<ftxM 

duk 

dLj 

duJk 


=  -<t>x(u)Lk 
=  E[XiXk] 


and  finally  note  that  Li I^-q  =  0  to  avoid  some  algebra  in  the  last  differenti¬ 
ation. 


20.13  (w)  It  is  desired  to  estimate  rx[0]  for  X[n\  being  WGN.  If  we  use  the  esti¬ 
mator,  r*[0]  =  (l/N)  E!L~o  x2[  n],  determine  the  mean  and  variance  of  rx[0]. 
Hint:  Use  (20.13). 

20.14  (^)  (f)  If  X[n]  =  U[n]  +  U[n  —  1],  where  U[n ]  is  a  WGN  random  process 
with  al  =  1,  find  £[X[0]X[1]X[2]X[3]]. 

20.15  (f)  Find  the  PSD  of  X2[n]  if  X[n]  is  WGN  with  a2x  =  2. 

20.16  (t)  To  argue  that  the  continuous-time  Wiener  random  process  is  a  Gaussian 

random  process,  we  replace  X(t)  =  where  U(£)  is  continuous-time 

WGN,  by  the  approximation 

[t/At] 

X(t)  =  ^  Z(nAt)At 

71—0 

where  [x]  indicates  the  largest  integer  less  than  or  equal  to  x  and  Z(t)  is  a 
zero  mean  WSS  Gaussian  random  process.  The  PSD  of  Z  ( t )  is  given  by 

/  ^  \F\  <  W 
{  0  \F\>W 


Pz(F )  = 


704 


CHAPTER  20.  GAUSSIAN  RANDOM  PROCESSES 


where  W  —  1/(2 A t).  Explain  why  X  (t)  is  a  Gaussian  random  process.  Next 
let  At  — ¥  0  and  explain  why  X (t)  becomes  a  Wiener  random  process. 

20.17  (o)  (w)  To  extract  A  from  a  realization  of  the  random  process  X (t)  = 
A  +  U{t),  where  U(t)  is  WGN  with  PSD  PV{F)  =  1  for  all  F,  it  is  proposed 
to  use 

A=fjo 

How  large  should  T  be  chosen  to  ensure  that  P[\A  —  A\  <  0.01]  =  0.99? 

20.18  (w)  To  generate  a  realization  of  a  continuous-time  Wiener  random  process  on 
a  computer  we  must  replace  the  continuous-time  random  process  by  a  sampled 
approximation.  To  do  so  note  that  we  can  first  describe  the  Wiener  random 
process  by  breaking  up  the  integral  into  integrals  over  smaller  time  intervals. 
This  yields 


X(t)  = 


where  ti  =  iAt  with  At  very  small,  and  tn  =  nAt  =  t.  It  is  assumed  that  t/ At 
is  an  integer.  Thus,  the  samples  of  X(t)  are  conveniently  found  as 

n 

X(tn)  =  X(nAt)  =  J2Xi 

i= 1 

and  the  approximation  is  completed  by  connecting  the  samples  X(tn)  by 
straight  lines.  Find  the  PDF  of  the  to  determine  how  they  should  be 
generated.  Hint:  The  X^s  are  increments  of  X(t). 

20.19  (^)  (f)  For  a  continuous-time  Wiener  random  process  with  v&r(X(t))  = 
determine  P[\X(t)\  >  1],  Explain  what  happens  as  t  — »  oo  and  why. 

20.20  (w)  Show  that  if  X(t)  is  a  Rayleigh  fading  sinusoid,  the  “demodulation”  and 
lowpass  filtering  shown  in  Figure  20.15  will  yield  U  and  F,  respectively.  What 
should  the  bandwidth  of  the  lowpass  filter  be? 

20.21  (c)  Generate  10  realizations  of  a  Rayleigh  fading  sinusoid  for  0  <  t  <  1.  Use 
Fq  =  10  Hz  and  a1  —  1  to  do  so.  Overlay  your  realizations.  Hint:  Replace 
X(t)  =  U cos(27rFot)  —  V sin(27rFbt)  by  X[n]  =  X(nAt)  =  U cos(27rFonAi)  — 
V  sm(2nFonAt)  for  n  =  0, 1, . . . ,  NAt,  where  At  =  1/N  and  N  is  large. 
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(a)  (b) 


Figure  20.15:  Extraction  of  Rayleigh  fading  sinusoid  lowpass  components  for  Prob¬ 
lem  20.20. 

20.22  (^)  (w)  Consider  X\  (t)  and  (£),  which  are  both  Rayleigh  fading  sinusoids 
with  frequency  Fo  =  1/2  and  which  are  independent  of  each  other.  Each 
random  process  has  the  total  average  power  a1  —  1.  If  Y(t)  =  X\  (t)  +  X<i  (£), 
find  the  joint  PDF  of  F(0)  and  Y(  1/4). 

20.23  (f)  A  Rayleigh  fading  sinusoid  has  the  PSD  Px{F )  =  S(F  +  10)  +  S(F  —  10). 
Find  the  PSDs  of  U(t)  and  V(t)  and  plot  them. 

20.24  (w)  Show  that  if  X ( t )  is  a  bandpass  random  process,  the  “demodulation”  and 
lowpass  filtering  given  in  Figure  20.16  will  yield  XJ(t)  and  V(t ),  respectively. 


(a)  (b) 


Figure  20.16:  Extraction  of  bandpass  random  process  lowpass  components  for  Prob¬ 
lem  20.24. 


20.25  (^)  (f)  If  a  bandpass  random  process  has  the  PSD  shown  in  Figure  20.17, 
find  the  PSD  of  U(t)  and  V(t). 
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Figure  20.17:  PSD  for  bandpass  random  process  for  Problem  20.25. 


20.26  (c)  The  random  process  whose  realization  is  shown  in  Figure  20.2  appears 
to  be  similar  in  nature  to  the  bandpass  random  processes  shown  in  Figure 
20.13b.  We  have  already  seen  that  the  marginal  PDF  appears  to  be  Gaussian 
(see  Figure  20.3).  To  see  if  it  is  reasonable  to  model  it  as  a  bandpass  random 
process  we  estimate  the  PSD.  First  run  the  code  given  in  Appendix  20A  to 
produce  the  realization  shown  in  Figure  20.2.  Then,  run  the  code  given  below 
to  estimate  the  PSD  using  an  averaged  periodogram  (see  also  Section  17.7 
for  a  description  of  this).  Does  the  estimated  PSD  indicate  that  the  random 
process  is  a  bandpass  random  process?  If  so,  explain  how  you  can  give  a 
complete  probabilistic  model  for  this  random  process. 


Fs=100;  7*  set  sampling  rate  for  later  plotting 

L=50;I=20;  7.  L  =  length  of  block,  I  =  number  of  blocks 
n=[0: I*L-1] >  ;  %  set  up  time  indices 
Nfft=1024;  7,  set  FFT  length  for  Fourier  transform 
Pav=zeros(Nfft , 1) ; 

f=[0:Nfft-l]  VNfft-0.5;  70  set  discrete-time  frequencies 
for  i=0 : 1-1 

nstart=l+i*L;nend=L+i*L;  7.  set  start  and  end  time  indices 

7.  of  block 

y=x(nstart  :nend) ;  7®  extract  block  of  data 
Pav=Pav+ ( 1/ (I*L) ) *abs (f ft shift (f f t (y ,Nf f t) ) ) . "2 ; 

7®  compute  periodogram 
7#  and  add  to  average 
7.  of  periodograms 

end 

F=f*Fs;  7o  convert  to  continuous -time  (analog)  frequency  in  Hz 
Pest=Pav/Fs;  7#  convert  discrete-time  PSD  to  continuous-time  PSD 
plot (F, Pest) 
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20.27  (f)  For  the  Gaussian  random  process  with  mean  zero  and  PSD 


90  <  |F|  <  110 
otherwise 


find  the  probability  that  its  envelope  will  be  less  than  or  equal  to  10  at  t  =  10 
seconds.  Repeat  the  calculation  if  t  =  20  seconds. 

20.28  (w)  Prove  that  Xi [n]  =  \U[n  +  l]  +  \U[n  —  1]  and^2 [n\  =  \U[n\  +  \U[n  —  2], 
where  U[n\  is  WGN  with  o\j  —  1,  both  have  the  same  PSD  given  by  Px(P)  — 
i(l  +  cos(47 r/)). 


20.29  (w)  It  is  desired  to  generate  a  realization  of  a  WSS  Gaussian  random  process 
by  filtering  WGN  with  an  LSI  filter.  If  the  desired  PSD  is  Px{f)  = 

\  exp(— j2nf)\2,  explain  how  to  do  this. 


20.30  (^)  (w)  It  is  desired  to  generate  a  realization  of  a  WSS  Gaussian  random 
process  by  filtering  WGN  with  an  LSI  filter.  If  the  desired  PSD  is  Px{f)  — 
2  —  2cos(27 r/),  explain  how  to  do  this. 


20.31  (o)  (c)  Using  the  results  of  Problem  20.30,  generate  a  realization  of  X[n\. 
To  verify  that  your  data  generation  appears  correct,  estimate  the  ACS  for 
k  =  0, 1, . . . ,  9  and  compare  it  to  the  theoretical  ACS. 


Appendix  20A 


MATLAB  Listing  for  Figure 

20.2 


clear  all 
rand(*  state  *  ,0) 

t=[0:0.01 :0.99] ’ ;  7®  set  up  transmit  pulse  time  interval 
F0=10 ; 

s=cos(2*pi*F0*t) ;  7®  transmit  pulse 

ss= [s; zeros (lOOO-length(s) ,1)]  ;  7®  put  transmit  pulse  in  receive  window 

tt=[0:0.01 :9.99]  } ;  7®  set  up  receive  window  time  interval 

x=zeros (1000,1) ; 

for  i=l:100  7®  add  up  all  echos,  one  for  each  0.1  sec  interval 

tau=round(10*i+10*(rand(l,l)-0.5))  ;  7®  time  delay  for  each  0.1  sec  interval 

7®  is  uniformly  distributed  -  round 
7®  time  delay  to  integer 

x=x+rand(l , 1) *shift (ss ,tau) ; 
end 

shift. m  subprogram 


7®  shift. m 

7. 

f miction  y=shift (x,Ns) 

7. 

7®  This  function  subprogram  shifts  the  given  sequence  by  Ns  points. 
7®  Zeros  are  shifted  in  either  from  the  left  or  right. 

7. 

7®  Input  parameters : 

7.  x  -  array  of  dimension  Lxl 

7®  Ns  -  integer  number  of  shifts  where  Ns>0  means  a  shift  to  the 
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#/0  right  and  Ns<0  means  a  shift  to  the  left  and  if  Ns=0,  then 

°/0  the  sequence  is  not  shifted 

7. 

#/o  Output  parameters : 

7.  y  -  array  of  dimension  Lxl  containing  the 
7.  shifted  sequence 

L=length(x)  ; 
if  abs(Ns)>L 
y=zeros(L,l) ; 
else 
if  Ns>0 
y(l :Ns,l)=0; 
y(Ns+l :L, l)=x(l :L-Ns) ; 
elseif  Ns<0 

y(L-abs(Ns)+l :L, 1)=0; 
y(l :L-abs(Ns) , l)=x(abs(Ns)+l :L) ; 
else 
y=x; 

end 

end 


Chapter  21 


Poisson  Random  Processes 


21.1  Introduction 

A  random  process  that  is  useful  for  modeling  events  occurring  in  time  is  the  Poisson 
random  process.  A  typical  realization  is  shown  in  Figure  21.1  in  which  the  events, 
indicated  by  the  “x”s,  occur  randomly  in  time.  The  random  process,  whose  real- 


N(t) 


Figure  21.1:  Poisson  process  events  and  the  Poisson  counting  random  process  N(t). 

ization  is  a  set  of  times,  is  called  the  Poisson  random  process.  The  random  process 
that  counts  the  number  of  events  in  the  time  interval  [0,  £],  and  which  is  denoted 
by  iV(£),  is  called  the  Poisson  counting  random  process.  It  is  clear  from  Figure 
21.1  that  the  two  random  processes  are  equivalent  descriptions  of  the  same  random 
phenomenon.  Note  that  N(t )  is  a  continuous-time/discrete- valued  (CTDV)  random 
process.  Also,  because  N(t)  counts  the  number  of  events  from  the  initial  time  t  =  0 
up  to  and  including  the  time  £,  the  value  of  N(t)  at  a  jump  is  AT(£+).  Thus,  N(t)  is 
right- continuous  (the  same  property  as  for  the  CDF  of  a  discrete  random  variable). 
The  motivation  for  the  widespread  use  of  the  Poisson  random  process  is  its  ability 
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to  model  a  wide  range  of  physical  and  man-made  random  phenomena.  Some  of 
these  are  the  distribution  in  time  of  radioactive  counts,  the  arrivals  of  customers 
at  a  cashier,  requests  for  service  in  computer  networks,  and  calls  made  to  a  central 
location,  to  name  just  a  few.  In  Chapter  5  we  gave  an  example  of  the  application 
of  the  Poisson  PMF  to  the  servicing  of  customers  at  a  supermarket  checkout.  Here 
we  examine  the  characteristics  of  a  Poisson  random  process  in  more  detail,  paying 
particular  attention  not  only  to  the  probability  of  a  given  number  of  events  in  a  time 
interval  but  also  to  the  probability  for  the  arrival  times  of  those  events.  In  order  to 
avoid  confusing  the  probabilistic  notion  of  an  event  with  the  common  usage,  we  will 
refer  to  the  events  shown  in  Figure  21.1  as  arrivals. 

The  Poisson  random  process  is  a  natural  extension  of  a  sequence  of  independent 
and  identically  distributed  Bernoulli  trials  (see  Example  16.1).  The  Poisson  counting 
random  process  N(t)  then  becomes  the  extension  of  the  binomial  counting  random 
process  discussed  in  Example  16.5.  To  make  this  identification,  consider  a  Bernoulli 
random  process,  which  is  defined  as  a  sequence  of  IID  Bernoulli  trials,  with  U[n]  =  1 
with  probability  p  and  U[n]  =  0  with  probability  1  —  p.  Now  envision  a  Bernoulli 
trial  for  each  small  time  slot  of  width  At  in  the  interval  [0,  t]  as  shown  in  Figure 
21.2.  Thus,  we  will  observe  either  a  1  with  probability  p  or  a  0  with  probability 

A 


1 


+— i—  i 


•  •  • 


At  2A  t 


•  • 


t=MAt 


Figure  21.2:  IID  Bernoulli  random  process  with  one  trial  per  time  slot. 

1  —  p  for  each  of  the  M  =  t/ At  time  slots.  Recall  that  on  the  average  we  will 
observe  Mp  ones.  Now  if  At  — ►  0  and  M  oo  with  t  =  M At  held  constant,  we 
will  obtain  the  Poisson  random  process  as  the  limiting  form  of  the  Bernoulli  random 
process.  Also,  recall  that  the  number  of  ones  in  M  IID  Bernoulli  trials  is  a  binomial 
random  variable.  Hence,  it  seems  reasonable  that  the  number  of  arrivals  in  a  Poisson 
random  process  should  be  a  Poisson  random  variable  in  accordance  with  our  results 
in  Section  5.6.  We  next  argue  that  this  is  indeed  the  case.  For  the  binomial  counting 
random  process,  thought  of  as  one  trial  per  time  slot,  we  have  that  the  number  of 
ones  in  the  interval  [0,  t]  has  the  PMF 

P[N{t)  =  k]  =  (M)pk(l-p)M~k  k  =  0,1, . . .  ,M. 

But  as  M  — »  oo  and  p  0  with  E[N(t)]  =  Mp  being  fixed,  the  binomial  PMF 
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becomes  the  Poisson  PMF  or  N(t)  ~  Pois(A/),  where  X1  =  i?[iV(i)]  =  Mp.  (Note 
that  as  the  number  of  time  slots  M  increases,  we  need  to  let  p  0  in  order  to 
maintain  an  average  number  of  arrivals  in  [0,  t].)  Thus,  replacing  A'  by  i?[iV(t)],  we 
write  the  Poisson  PMF  as 

P[N(t )  =  k}  =  exp {-E[N(t)})^^  k  =  0, 1, ...  .  (21.1) 

llJ  m 

To  determine  E[N(t )]  for  use  in  (21.1),  where  t  may  be  arbitrary,  we  examine  Mp 
in  the  limit.  Thus, 


E[N(t) 1  =  lim  Mp 

1  /J  M— »oo 

p-+0 

=  lim  =  t  lim  ~~ 

At-+0  At  At— ^0  A t 

p— >0  p-+0 

=  A  t 

where  we  define  A  as  the  limit  of  p/At.  Since  A  =  E[N(t)\/t,  we  can  interpret  A  as 
the  average  number  of  arrivals  per  second  or  the  rate  of  the  Poisson  random  process. 
This  is  a  parameter  that  is  easily  specified  in  practice.  Using  this  definition  we  have 
that 

P[N{t)  =  k]  =  exp(-Ai)^-  A;  =  0,1,....  (21.2) 

As  mentioned  previously,  N(t )  is  the  Poisson  counting  random  process  and  the 
probability  of  k  arrivals  from  t  =  0  up  to  and  including  t  is  given  by  (21.2).  It  is  a 
semi-infinite  random  process  with  N(0)  =  0  by  definition. 

It  is  possible  to  derive  all  the  properties  of  a  Poisson  counting  random  process 
by  employing  the  previous  device  of  viewing  it  as  the  limiting  form  of  a  binomial 
counting  random  process  as  At  -»  0.  However,  it  is  cumbersome  to  do  so  and 
therefore,  we  present  an  alternative  derivation  that  is  consistent  with  the  same  basic 
assumptions.  One  advantage  of  viewing  the  Poisson  random  process  as  a  limiting 
form  is  that  many  of  its  properties  become  more  obvious  by  consideration  of  a 
sequence  of  IID  Bernoulli  trials.  These  properties  are  inherited  from  the  binomial, 
such  as,  for  example,  the  increments  Nfa)  —  N(ti)  must  be  independent.  (Can  you 
explain  why  this  must  be  true  for  the  binomial  counting  random  process?) 

21.2  Summary 

The  Poisson  counting  random  process  is  introduced  in  Section  21.1.  The  probability 
of  k  arrivals  in  the  time  interval  [0,  t]  is  given  by  (21.2).  This  probability  is  also 
derived  in  Section  21.3  based  on  a  set  of  axioms  that  the  Poisson  random  process 
should  adhere  to.  Some  examples  of  typical  problems  for  which  this  probability  is 
useful  are  also  described  in  that  section.  The  times  between  arrivals  or  interarrival 
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times  is  shown  in  Section  21.4  to  be  independent  and  exponentially  distributed  as 
given  by  (21.6).  The  arrival  times  of  a  Poisson  random  process  are  described  by  an 
Erlang  PDF  given  in  (21.8).  An  extension  of  the  Poisson  random  process  that  is 
useful  is  the  compound  Poisson  random  process  described  in  Section  21.6.  Moments 
of  the  random  process  can  be  found  from  the  characteristic  function  of  (21.12). 
In  particular,  the  mean  is  given  by  (21.13).  A  Poisson  random  process  is  easily 
simulated  on  a  computer  using  the  MATLAB  code  listed  in  Section  21.7.  Finally, 
an  application  of  the  compound  Poisson  random  process  to  automobile  traffic  signal 
planning  is  the  subject  of  Section  21.8. 


21.3  Derivation  of  Poisson  Counting  Random  Process 


We  next  derive  the  Poisson  counting  random  process  by  appealing  to  a  set  of  axioms 
that  are  consistent  with  our  previous  assumptions.  Clearly,  since  the  random  process 
starts  at  t  =  0,  we  assume  that  N(0)  =  0.  Next,  since  the  binomial  counting 
random  process  has  increments  that  are  independent  and  stationary  (Bernoulli  trials 
are  IID),  we  assume  the  same  for  the  Poisson  counting  random  process.  Thus, 
for  two  increments  we  assume  that  the  random  variables  Ii  =  N(t 2)  —  N(ti)  and 
I2  =  N(t±)  —  N(ts)  are  independent  if  £4  >  £3  >  £2  >  h  and  also  have  the  same 
PDF  if  additionally  £4  —  £3  =  £2  ~  h-  Likewise,  we  assume  this  is  true  for  all  possible 
sets  of  increments.  Note  that  £4  >  £3  >  £2  >  h  corresponds  to  nonoverlapping  time 
intervals.  The  increments  will  still  be  independent  if  £2  =  h  or  the  time  intervals 
have  a  single  point  in  common  since  the  probability  of  N(t)  changing  at  a  point 
is  zero  as  we  will  see  shortly.  As  for  the  Bernoulli  random  process,  there  can  be 
at  most  one  arrival  in  each  time  slot.  Similarly,  for  the  Poisson  counting  random 
process  we  allow  at  most  one  arrival  for  each  time  slot  so  that 


P[N(t  +  At)  -  N(t)  =  k]  = 


1  —  p  k  =  0 
p  k  =  1 


and  recall  that 

lim  x"  =  A 

At— ^0  At 

p— >0 

so  that  for  At  small,  p  =  XAt  and 


f  1-AA  t  k  =  0 
P[N{t  +  At)  -  N(t)  =  jfe]  -  l  XAt  k  =  1 

0  k>  2. 


Therefore,  our  axioms  become 
Axiom  1  iV(0)  =  0. 

Axiom  2  N(t)  has  independent  and  stationary  increments. 
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Axiom  3  P[N(t  +  At)  —  N(t)  =  k]  = 
for  all  t. 

With  these  axioms  we  wish  to  prove  that  (21.2)  follows.  The  derivation  is  indica¬ 
tive  of  an  approach  commonly  used  for  analyzing  continuous-time  Markov  random 
processes  [Cox  and  Miller  1965]  and  so  is  of  interest  in  its  own  right. 

21.3.1  Derivation 

To  begin,  consider  the  determination  of  P[N(t)  =  0]  for  an  arbitrary  t  >  0.  Then 
referring  to  Figure  21.3a  we  see  that  for  no  arrivals  in  [0,  i],  there  must  be  no  arrivals 
in  [0,  t  —  At]  and  also  no  arrivals  in  ( t  —  A i,t],  Therefore, 


f  1-AAt  k  =  0 
XAt  k  =  1 


0  arrivals  0  arrivals 


E 

o 


3—3  E 


t-  At  t 


o 


0  arrivals  1  arrival 

1  arrival  0  arrivals 


3—3 


t  -  At 


(a)  N(t)  =  0 


(b)  N(t)  =  1 


Figure  21.3:  Possible  number  of  arrivals  in  indicated  time  intervals. 


P[N(t)  =  0]  = 


If  we  let  Pp  ( t ) 


P[N(t 

P[N(t 

P[N(t 

P[N(t 

P[N(t) 


At)  =  0,  N(t)  -  N(t  -  At)  =  0] 
At)  =  0]P[IV(i)  -  N(t  -  At)  =  0] 
At)  =  0 ]P[N(t  +  At)  -  N(t)  =  0] 
At)  =0](1  -  A  At) 

0],  then 

P0(t)  =  P0(t- At) (1-  XAt) 


(Axiom  2  -  independence) 
(Axiom  2  -  stationarity) 
(Axiom  3). 


or 


P0(t)  -P0(t-  At) 
At 


=  —XPp(t  —  At). 


Now  letting  At  ->  0,  we  arrive  at  the  linear  differential  equation 

dPp(t) 


dt 


=  ~XPo(t) 


for  which  the  solution  is  Pp(t)  =  cexp(— At),  where  c  is  an  arbitrary  constant.  To 
evaluate  the  constant  we  invoke  the  initial  condition  that  Pp  (0)  =  P[AI(0)  =  0]  =  1 
by  Axiom  1  to  yield  c  =  1.  Thus,  we  have  finally  that 

P[N(t)  =  0]  =  Pp(t)  =  exp(— A  t). 
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Next  we  use  the  same  argument  to  find  a  differential  equation  for  P\{t)  =  P[N(t)  = 
1]  by  referring  to  Figure  21.3b.  We  can  either  have  no  arrivals  in  [0,  t  —  At]  and  one 
arrival  in  ( t  —  At,  t]  or  one  arrival  in  [0,  t  —  At]  and  no  arrivals  in  (t  —  At,  t].  These 
are  the  only  possibilities  since  there  can  be  at  most  one  arrival  in  a  time  interval  of 
length  At.  The  two  events  are  mutually  exclusive  so  that 

P[N{t)  =  1]  =  P[N(t  -  At)  =  0,  N(t)  -  N(t  -  At)  =  1] 

+P[N{t  -  At)  =  1,  N(t)  -  N(t  -  At)  =  0] 

=  P[N(t  -  At)  =  0 ]P[N(t)  -  N(t  -  At)  =  1] 

+P[N(t  —  At)  =  1  ]P[N(t)  —  N(t  —  At)  =  0]  (independence) 
=  P[N(t  -  At)  =  0 ]P[N(t  +  At)  -  N(t)  =  1] 

+P[N(t  —  At)  =  l]P[iV(t  +  At)  —  N(t)  =  0].  (stationarity) 

Using  the  definition  of  Pi(t)  and  Axiom  3, 

P\{t)  =  Po(t  —  At)XAt  +  Pi(t  —  At)(l  —  AAi) 


■Pi(t)  —  Pi(t  —  At) 

At 


=  -XPi(t  -  At)  +  XP0(t  -  At) 


and  as  At  — »  0,  we  have  the  differential  equation 

^p.  +  XP1(t)  =  XP0(t). 

In  like  fashion  we  can  show  (see  Problem  21.1)  that  if  Pk(t)  =  P[N(t)  —  fe],  then 

dPk(t)  , 

\  r>  /jl\  \  7~>  /j.\  7.  i  o  /m 


+  XPk(t)  =  XPk-i(t)  k  =  1,2,... 


(21.3) 


where  we  know  that  Po(t)  =  exp  (—At).  This  is  a  set  of  simultaneous  linear  differen¬ 
tial  equations  that  fortunately  can  be  solved  recursively.  Since  Po(^)  is  known,  we 
can  solve  for  Pi(t).  Once  Pi(t)  has  been  found,  then  P2W  can  be  solved  for,  etc. 
It  is  shown  in  Problem  21.2  that  by  using  Laplace  transforms,  we  can  easily  solve 
these  equations.  The  result  is 


(a  ty 


Pk{t)  =  exp(-At) 
so  that  finally  we  have  the  desired  result 


P[N(t)  =  k]  =  exp(— At) 


k  =  0, 1, . . . 


(a  ty 


k  =  0, 1, . . . 


(21.4) 


which  is  the  usual  Poisson  PMF.  The  only  difference  from  that  described  in  Section 
5.5.4  is  that  here  A  represents  an  arrival  rate.  Since  if  X  ~  Pois(A'),  then  E[X]  =  A', 
we  have  A'  =  At.  Hence,  A  =  A'/t  =  E[N(t)]/t,  which  is  seen  to  be  the  average 
number  of  arrivals  per  second. 
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21.3.2  Some  Examples 

Before  proceeding  with  some  examples  it  should  be  pointed  out  that  the  Poisson 
counting  random  process  is  not  stationary  or  even  WSS.  This  is  evident  from  the 
PMF  of  N(t)  since  P[iV(£2)]  =  A t2  ^  Xt\  =  P[iV(£i)]  for  t2  ^  t\.  As  its  properties 
are  inherited  from  the  binomial  counting  random  process,  it  exhibits  the  properties 
of  a  sum  random  process  (see  Section  16.4).  Also,  in  determining  probabilities  of 
events,  the  fact  that  the  increments  are  independent  and  stationary  will  greatly 
simplify  our  calculations. 

Example  21.1  -  Customer  arrivals 

Customers  arrive  at  a  checkout  lane  at  the  rate  of  0.1  customers  per  second  ac¬ 
cording  to  a  Poisson  random  process.  Determine  the  probability  that  5  customers 
will  arrive  during  the  first  minute  the  lane  is  open  and  also  5  customers  will  arrive 
the  second  minute  it  is  open.  During  the  time  interval  [0, 60]  the  probability  of  5 
arrivals  is  from  (21.4) 

P[iV(60)  =  5]  =  exp[-0.1(60)]^Q'1^,60^  =  0.1606. 

5! 

This  will  also  be  the  probability  of  5  customers  arriving  during  the  second  minute 
interval  or  for  any  one  minute  interval  [t,  t  +  60]  since 

P[N(t  +  60)  —  N(t)  =  5]  =  P[iV(60)  —  N( 0)  =  5]  (increment  stationarity) 

=  P[iV(60)  -  5]  (N( 0)  -  0) 

which  is  not  dependent  on  t.  Hence,  the  probability  of  5  customers  arriving  in  the 
first  minute  and  5  more  arriving  in  the  second  minute  is 

P[N(60)  -  N{ 0)  =  5,iV(120)  -  N( 60)  =  5] 

=  P[iV(60)  —  N( 0)  =  5]P[iV(120)  —  AT(60)  =  5]  (increment  independence) 

=  P[N(60)  —  N( 0)  =  5]P[JV(60)  —  iV(0)  =  5]  (increment  stationarity) 

=  P2[N(  60)  =  5]  =  0.0258  (JV(0)  =  0) 

❖ 


Example  21.2  —  Traffic  bursts 

Consider  the  arrival  of  cars  at  an  intersection.  It  is  known  that  for  any  5  minute 
interval  50  cars  arrive  on  the  average.  For  any  5  minute  interval  what  is  the  prob¬ 
ability  of  20  cars  in  the  first  minute  and  30  cars  in  the  next  4  minutes?  Since  the 
probabilities  of  the  increments  do  not  change  with  the  time  origin  due  to  stationar¬ 
ity,  we  can  assume  that  the  5  minute  interval  in  question  starts  at  t  =  0  and  ends 
at  t  =  300  seconds.  Thus,  we  wish  to  determine  the  probability  of  a  traffic  burst 
Pjj.  which  is 

PB  =  P[N{ 60)  =  20,  N (300)  -  iV(60)  =  30]. 
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Since  the  increments  axe  independent,  we  have 

PB  =  P[jV(60)  =  20]P[JV(300)  -  JV(60)  =  30] 


and  because  they  are  also  stationary 


=  P[N(W)  =  20]P[1V(240)  -  JV(0)  =  30] 

=  P[1V(60)  =  20]P[1V(240)  =  30] 

,  .  (60A)20  .  (240A)30 

=  exp(— 60A)  ■  ■  exp(— 240A)- 


20! 


30! 


Finally,  since  the  arrival  rate  is  given  by  A  =  50/300  =  1/6,  the  probability  of  a 
traffic  burst  is 

(W)20  (4fl\30 

PB  =  exp(-10)^rexp(-40)^r  =  3.4458  x  1(T5. 

❖ 

In  many  applications  it  is  important  to  assess  not  only  the  probability  of  a  number 
of  arrivals  within  a  given  time  interval  but  also  the  distribution  of  these  arrival 
times.  Are  they  evenly  spaced  or  can  they  bunch  up  as  in  the  last  example?  In  the 
next  section  we  answer  these  questions. 


21.4  Interarrival  Times 

Consider  a  typical  realization  of  a  Poisson  random  process  as  shown  in  Figure 
21.4.  The  times  are  called  the  arrival  times  while  the  time  intervals 

N(t) 


ti  £2  £3  £4  £5 


Figure  21.4:  Definition  of  arrival  times  ti  s  and  interarrival  times  Z{  s. 

zi,  Z2, 23,  •  •  •  are  called  the  interarrival  times.  The  interarrival  times  shown  in  Fig¬ 
ure  21.4  are  realizations  of  the  random  variables  Z\,  Z2,  Z3, _  We  wish  to  be 
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able  to  compute  probabilities  for  a  finite  set,  say  Zi,  Z2, . . . ,  Zk •  Since  N(t)  is  a 
continuous-time  random  process,  the  time  between  arrivals  is  also  continuous  and 
so  a  joint  PDF  is  sought.  To  begin  we  first  determine  pzi(^i).  Note  that  Z\  —  Ti, 
where  T\  is  the  random  variable  denoting  the  first  arrival.  By  the  definition  of  the 
first  arrival  if  Z\  >  £1,  then  N(£  1)  =  0  as  shown  in  Figure  21.4.  Conversely,  if 
N(£  1)  =  0,  then  the  first  arrival  has  not  occurred  as  of  time  £1  and  so  Z\  >  £1. 
This  argument  shows  that  the  events  {Z\  >  £1}  and  {iV(£  1)  =  0}  are  equivalent 
and  therefore 


P[Zi>Zi]  =  P[N(Z  1)  =0] 

=  exp(— A£i)  £1  >  0  (21.5) 

where  we  have  used  (21.4).  As  a  result,  the  PDF  is  for  z\  >  0 

d 

pzAz±)  =  fa~Fz  1(^1) 

=  £ <1 -it*  >  «D 

=  ^-[1  -exp(-A^i)] 

=  Aexp(— Xz\) 

and  finally  the  PDF  of  the  first  arrival  is 

/  \  /  Aexp(-A^i)  zi  >  0  (  . 

=  |  0  Z]<0  (21.«) 

or  Z\  ~  exp(A).  An  example  follows. 

Example  21.3  —  Waiting  for  an  arrival 

Assume  that  at  t  =  0  we  start  to  wait  for  an  arrival.  Then  we  know  from  (21.6) 
that  the  time  we  will  have  to  wait  is  a  random  variable  with  Z\  ~  exp(A).  On  the 
average  we  will  have  to  wait  E[Z\\  =  1/A  seconds.  This  is  reasonable  in  that  A  is 
average  arrivals  per  second  and  therefore  1/A  is  seconds  per  arrival.  However,  say 
we  have  already  waited  £1  seconds — what  is  the  probability  that  we  will  have  to  wait 
more  than  an  additional  £2  seconds?  In  probabilistic  terms  we  wish  to  compute  the 
conditional  probability  P[Z\  >  £1  +  ^\Z\  >  £1].  This  is  found  as  follows. 

P[Zi  >6+61^1  >6]  =  P[Zl  >pl[zi  ^  -  6] 

_  P[Z\  >  £1  +  £2] 

P[Z 1  >  £1] 


since  the  arrival  time  will  be  greater  than  both  £1  +  £2  and  £1  only  if  it  is  greater 
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than  the  former.  Now  using  (21.5)  we  have  that 


P[Z\  >  £i  +  &\Zi  >  £i]  = 


exp[-A(^i  +  6)] 
exp(-A^i) 

exp(-A^) 

P[Zi  >  &]• 


(21.7) 


Hence,  the  conditional  probability  that  we  will  have  to  wait  more  than  an  additional 
£2  seconds  given  that  we  have  already  waited  £1  seconds  is  just  the  probability  that 
we  will  have  to  wait  more  than  £2  seconds.  The  fact  that  we  have  already  waited 
does  not  in  any  way  affect  the  probability  of  the  first  arrival.  Once  we  have  waited 
and  observed  that  no  arrival  has  occured  up  to  time  £1,  then  the  random  process  in 
essence  starts  over  as  if  it  were  at  time  t  =  0.  This  property  of  the  Poisson  random 
process  is  referred  to  as  the  memoryless  property.  It  is  somewhat  disconcerting  to 
know  that  the  chances  your  bus  will  arrive  in  the  next  5  minutes,  given  that  it  is 
already  5  minutes  late,  is  not  any  better  than  your  chances  it  will  be  late  by  5 
minutes.  However,  this  conclusion  is  consistent  with  the  Poisson  random  process 
model.  It  is  also  evident  by  examining  the  similar  result  of  waiting  for  a  fair  coin 
to  comes  up  heads  given  that  it  has  already  exhibited  10  tails  in  a  row.  In  Problem 
21.12  an  alternative  derivation  of  the  memoryless  property  is  given  which  makes  use 
of  the  geometric  random  variable. 

0 

We  next  give  the  joint  PDF  for  two  or  more  interarrival  times.  It  is  shown  in 
Appendix  21 A  that  the  interarrival  times  Zi,  Z2, . . . ,  Zk  are  IID  random  variables 
with  each  one  having  Z\  ~  exp(A).  This  result  may  also  be  reconciled  in  light  of 
the  Poisson  random  process  being  the  limiting  form  of  a  Bernoulli  random  process. 
Consider  a  Bernoulli  random  process  {X[0]  =  0,  X[l],  X[2], . . .},  where  X[0]  =  0  by 
definition,  and  assume  interarrival  times  of  k\  and  &25  where  &i  >  1,  &2  >  1.  For 
example,  if  X[l]  =  0,X[2]  =  1,X[3]  =  0,X[4]  =  0,  and  X[S\  =  1,  then  we  would 
have  k\  =  2  and  &2  =  3.  In  general, 


P [first  interarrival  time  =  k\,  second  interarrival  time  =  k 2] 

=  P[X[n]  =  0  for  1  <  n  <  kx  -  l,X[ki]  =  1  ,X[n]  =  0 
for  k\  +  1  <  n  <  k\  +  &2  ~  1?  X[k\  +  £2]  =  1] 

=  K'i--p)kl~1p][(i-p)k2~1p]- 

Hence,  the  joint  PMF  factors  so  that  the  interarrival  times  are  independent  and 
furthermore  they  are  identically  distributed  (let  k\  —  k^)-  An  example  follows. 

Example  21.4  -  Expected  time  for  calls 

A  customer  call  service  center  opens  at  9  A.M.  The  calls  received  follow  a  Poisson 
random  process  at  the  average  rate  of  600  calls  per  hour.  The  20th  call  comes  in 
at  9:01  A.M.  At  what  time  can  we  expect  the  next  call  to  come  in?  Let  Z21  be 
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the  elapsed  time  from  9:01  A.M.  until  the  next  call  comes  in.  Since  the  interarrival 
times  are  independent,  they  do  not  depend  upon  the  past  history  of  arrivals.  Hence, 
Z21  =  T21  —  T20  ~  exp(A).  Since  the  mean  of  an  exponential  random  variable  Z 
is  just  1/A  and  from  the  information  given  A  =  600/3600  =  1/6  calls  per  second, 
we  have  that  E[Z2i]  =  1/(1  /6)  =  6  seconds.  Hence,  we  can  expect  the  next  call  to 
come  in  at  9:01:06  A.M. 

0 


21.5  Arrival  Times 


The  kth  arrival  time  T is  defined  as  the  time  from  t  =  0  until  the  kth  arrival  occurs. 
The  arrival  times  are  illustrated  in  Figure  21.4,  where  T &  is  also  referred  to  as  the 
waiting  time  until  the  kth  arrival.  In  this  section  we  will  determine  the  PDF  of  T&. 
It  is  seen  from  Figure  21.4  that  t =  Yli=i  zi  so  that  the  random  variable  of  interest 
is 

k 

Tk  =  Y,  Zi- 

i= 1 

But  we  saw  in  the  last  section  that  the  Z\  s  are  IID  with  Z\  ~  exp(A).  Hence,  the 
PDF  of  Tk  is  obtained  by  determining  the  PDF  for  a  sum  of  IID  random  variables. 
This  is  a  problem  that  has  been  studied  in  Section  14.6,  and  is  solved  most  readily  by 
the  use  of  the  characteristic  function.  Recall  that  if  Xi,  X2,  • . . ,  are  IID  random 
variables,  then  the  characteristic  function  for  Y  —  ls  4> Y (^)  =  $ kx (cv). 

Thus,  the  PDF  for  Y,  assuming  that  Y  is  a  continuous  random  variable,  is  found 
from  the  continuous-time  inverse  Fourier  transform  (defined  to  correspond  to  the 
Fourier  transform  used  in  the  characteristic  function  definition,  and  uses  a  —j  and 
radian  frequency  w)  as 


PY 


4>kx{u)  exp(-jujy) 


du 

2i r  * 


From  Table  11.1  we  have  that  =  A/(A  —  joo)  and  therefore 

Mu)  =  (^)  =  (t^tx)  ' 

Again  referring  to  Table  11.1,  we  see  that  this  is  the  characteristic  function  of  a 
Gamma  random  variable  with  a  —  k  so  that  T\~  ~  r(&,  A).  Specifically,  this  is  the 
Erlang  random  variable  described  in  Section  10.5.6  .  Hence,  we  have  that 

A  k 

PTk{t)  =  (fc  _  exP(~Ai)-  (21-8) 

(See  also  Problem  21.15  for  the  derivation  for  k  =  2  using  a  convolution  integral  and 
Problem  21.16  for  an  alternative  derivation  for  the  general  case.)  Note  that  for  a 
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r(a,  A)  random  variable  the  mean  is  a/X  so  that  with  a  =  fc,  we  have  the  expected 
time  for  the  kth  arrival  as 


E[Tk]  = 


k 

X 


(21.9) 


or  equivalently 


E[Tk]  =  kE[Ti\. 


(21.10) 


On  the  average  the  time  to  the  kth  arrival  is  just  k  times  the  time  to  the  first  arrival, 
a  somewhat  pleasing  result.  An  example  follows. 

Example  21.5  —  Computer  servers 

A  computer  server  is  designed  to  provide  downloaded  software  when  requested.  It 
can  honor  a  total  of  80  requests  in  each  hour  before  it  becomes  overloaded.  If  the 
requests  are  made  in  accordance  with  a  Poisson  random  process  at  an  average  rate 
of  60  requests  per  hour,  what  is  the  probability  that  it  will  be  overloaded  in  the  first 
hour?  We  need  to  determine  the  probability  that  the  81st  request  will  occur  at  a 
time  t  <  3600  seconds.  Thus,  from  (21.8)  with  A;  =  81 


Pfoverloaded  in  first  hour]  =  P[Tgi  <  3600] 

*3600  \  81 

— i80  exp(-A  t)dt. 


s: 


Here  the  arrival  rate  of  the  requests  is  A  =  60/3600  =  1/60  per  second  and  therefore 


y  /*3600  y 

P  [overloaded  in  first  hour]  =  —  — - 

L  J  60  J0  80! 


t 


80 


60 


exp(— t/60)dt 


Using  the  result 


/ 


(atr 


n\ 


exp(— at)dt  —  — 


exp  (—at)  (at) 


! 


a  "  2i 

2—0 


it  follows  that 

P  [overloaded  in  first  hour]  = 


1 

60 


exp(— 1/60) 
1760 


(t/60)< 


3600" 

0 


80 

exp  (-60) 


2=0 

80 


1  -  exp  (—60) 

i=0 


—  1 

=  0.0056. 


❖ 
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21.6  Compound  Poisson  Random  Process 

A  Poisson  counting  random  process  increments  its  value  by  one  for  each  new  arrival. 
In  some  applications  we  may  not  know  the  increment  in  advance.  An  example  would 
be  to  determine  the  average  amount  of  all  transactions  within  a  bank  for  a  given 
day.  In  this  case  the  amount  obtained  is  the  sum  of  all  deposits  and  withdrawals. 
To  model  these  transactions  we  could  assume  that  customers  arrive  at  the  bank 
according  to  a  Poisson  random  process.  If,  for  example,  each  customer  deposited 
one  dollar,  then  at  the  end  of  the  day,  say  at  time  the  total  amount  of  the 
transactions  X(to)  could  be  written  as 

N(t0) 

X(t0)  =  ]T  1  =  N(t0). 

i= 1 

This  is  the  standard  Poisson  counting  random  process.  If,  however,  there  are  with¬ 
drawals,  then  this  would  no  longer  hold.  Furthermore,  if  the  deposits  and  with¬ 
drawals  are  unknown  to  us  before  they  are  made,  then  we  would  need  to  model 
each  one  by  a  random  variable,  say  U{.  The  random  variable  would  take  on  positive 
values  for  deposits  and  negative  values  for  withdrawals  and  probabilities  could  be 
assigned  to  the  possible  values  of  Ui.  The  total  dollar  amount  of  the  transactions  at 
the  end  of  the  day  would  be 

N(t0) 

i= 1 

With  this  motivation  we  will  consider  the  more  general  case  in  which  the  Ui  s  are 
either  discrete  or  continuous  random  variables,  and  denote  the  total  at  time  t  by 
the  random  process  X(t).  This  random  process  is  therefore  given  by 

N(t) 

m  =  £  Ui  t>  0.  (21.11) 

i= 1 

It  is  a  continuous-time  random  process  but  can  be  either  continuous-valued  or 
discrete- valued  depending  upon  whether  the  U^s  are  continuous  or  discrete  random 
variables.  We  furthermore  assume  that  the  U^s  are  IID  random  variables.  Hence, 
X  ( t )  is  similar  to  the  usual  sum  of  IID  random  variables  except  that  the  number 
of  terms  in  the  sum  is  random  and  the  number  of  terms  is  distributed  according 
to  a  Poisson  random  process.  This  random  process  is  called  a  compound  Poisson 
random  process. 

In  summary,  we  let  X  ( t )  =  Ui  for  t  >  0,  where  the  UiS  are  IID  random 

variables  and  N(t)  is  a  Poisson  counting  random  process  with  arrival  rate  A.  Also, 
we  define  X(0)  =  0,  and  furthermore  assume  that  the  Ui  s  and  N(t)  are  independent 
of  each  other  for  all  t. 
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We  next  determine  the  marginal  PMF  or  PDF  of  X(t).  To  do  so  we  will  use 
characteristic  functions  in  conjunction  with  conditioning  arguments.  The  key  to 
success  here  is  to  turn  the  sum  with  a  random  number  of  terms  into  one  with  a  fixed 
number  by  conditioning.  Then,  the  usual  characteristic  function  approach  described 
in  Section  14.6  will  be  applicable.  Hence,  consider  for  a  fixed  t  =  to  the  random 
variable  X(to)  and  write  its  characteristic  function  as 


4>x(to)(u)  — 


E[exp(ju)X(to))] 


(. 

exp  \ju  2^  Ui  J 


EN(t0) 


EUu...,Uk\N(to) 


EN(t0) 


,Uk 


(definition) 


N(to)  =  k 
(see  Problem  21.18) 

(Ui  s  independent  of  N(to)) 


E 


N(t0) 


EUi,...,Uk 


IJexp^Wi) 


Li=l 


E 


N(to) 


k 


E 


N(to) 


E 


II  Eui  [exP (juUi)} 

i—1 
k 

n  ^  n 

Li=l 
k 


<t>lh  (w) 


N(t0) 
oo 

Ys^uA^PNito^k] 


k= 0 
oo 


k 


£  4>ui  M  exp(-  A*0) 

k= 0 

exp(-Ato)  f) 

fe=0 

exp(-Ato)  exp(\to<f>Ul  (u)) 


(Ui  s  are  independent) 

(definition  of  char,  function) 
(UiS  identically  dist.) 


so  that  finally  we  have  the  characteristic  function 


0X(fo)H  =  exp[Ato(0t/1(w)  -  1)].  (21.12) 

To  determine  the  PMF  or  PDF  of  X(to)  we  would  need  to  take  the  inverse  Fourier 
transform  of  the  characteristic  function.  As  a  check,  if  we  let  U{  =  1  for  all  i  so  that 
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from  (21.11)  X(to)  =  N(to),  then  since 

<l>Ui{w)  =  E[exp(juUi)\  =  exp(ju) 

we  have  the  usual  characteristic  function  of  a  Poisson  random  variable  (see  Table 

6.1) 

=  exp[At0(exp(jo;)  -  1)]. 

(The  derivation  of  (21.12)  can  be  shown  to  hold  for  this  choice  of  the  Ui  s,  which 
are  degenerate  random  variables.)  An  example  follows. 

Example  21.6  -  Poisson  random  process  with  dropped  arrivals 

Consider  a  Poisson  random  process  in  which  some  of  the  arrivals  are  dropped.  This 
means  for  example  that  a  Geiger  counter  may  not  record  radioactive  particles  if  their 
intensity  is  too  low.  Assume  that  the  probability  of  dropping  an  arrival  is  1  —  p, 
and  that  this  event  is  independent  of  the  Poisson  arrival  process.  Then,  we  wish  to 
determine  the  PMF  of  the  number  of  arrivals  within  the  time  interval  [0,  to]*  Thus, 
the  number  of  arrivals  can  be  represented  as 

N(t0) 

X(to)  =Y.U' 

i= 1 

where  Ui  —  1  if  the  ith  arrival  is  counted  and  Ui  —  0  if  it  is  dropped.  Assuming  that 
the  Ui  s  are  IID,  we  have  a  compound  Poisson  random  process.  The  characteristic 
function  of  X(to)  is  found  using  (21.12)  where  we  note  that 

<t>lh  M  =  E[exp(juUi)] 

=  pexp(jw)  +  (1  -p) 

so  that  from  (21.12) 

4>x(t0)(u)  =  exp[At0(pexpO'w)  +  (l-p)-l)] 

=  exp [pAt0 (exp (ju)  -  1)]. 

But  this  is  just  the  characteristic  of  a  Poisson  counting  random  process  with  arrival 
rate  of  pA.  Hence,  by  dropping  arrivals  the  arrival  rate  is  reduced  but  X(t)  is  still 
a  Poisson  counting  process,  a  very  reasonable  result. 

❖ 

Since  the  characteristic  function  of  a  compound  Poisson  random  process  is  available, 
we  can  use  it  to  easily  find  the  moments  of  X(to ).  In  particular,  we  now  determine 
the  mean,  leaving  the  variance  as  a  problem  (see  Problem  21.22).  Using  (21.12)  we 
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have 


E[X(t  o)] 


1  d(j)X[t0){u ) 


duj 


(using  (6.13)) 


o;=0 


-Ai0  ^Ul  ^  exp[At0(^t/1  (w) 

J  du 

w  1  #!7iM 

Aio-  - j - 

3  du  lu=o 


-1)] 


cj=0 


since  <^^(0)  =  1.  But 


E[Ui]  = 


1  d(pUl  (u) 
j  dh) 


w=0 


so  that  the  average  value  is 


E[X(t0)]  =  XtoElUx]  =  E[N{t0)]E[Ui].  (21.13) 

It  is  seen  that  the  average  value  of  X(to)  is  just  the  average  value  of  U\  times 
the  expected  number  of  arrivals.  This  result  also  holds  even  if  the  U\  s  only  have 
the  same  mean,  without  the  IID  assumption  (see  Problem  21.25  and  the  real-world 
problem).  An  example  follows. 

Example  21.7  -  Expected  number  of  points  scored  in  basketball  game 

A  basketball  player,  dubbed  the  “Poisson  pistol  Pete”  of  college  basketball,  shoots 
the  ball  at  an  average  rate  of  1  shot  per  minute  according  to  a  Poisson  random 
process.  He  shoots  a  2  point  shot  with  a  probability  of  0.6  and  a  3  point  shot  with  a 
probability  of  0.4.  If  his  2  point  field  goal  percentage  is  50%  and  his  3  point  field  goal 
percentage  is  30%,  what  is  his  expected  total  number  of  points  scored  in  a  40  minute 
game?  (We  assume  that  the  referees  “let  them  play”  so  that  no  fouls  are  called  and 
hence  no  free  throw  points.)  The  average  number  of  points  is  E[N(to)]E[Ui],  where 
to  =  2400  seconds  and  U\  is  a  random  variable  that  denotes  his  points  made  for  the 
first  shot  (the  distribution  for  each  shot  is  identical) .  We  first  determine  the  PMF 
for  E/i,  where  we  have  implicitly  assumed  that  the  U^s  are  IID  random  variables. 
Prom  the  problem  description  we  have  that 

{2  if  2  point  shot  attempted  and  made 
3  if  3  point  shot  attempted  and  made 
0  otherwise. 


Hence, 


PUi  [2]  = 


P[2  point  shot  attempted  and  made] 

P[2  point  shot  made  |  2  point  shot  attempted]  P [2  point  shot  attempted] 
0.5(0. 6)  =  0.3 
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and  similarly  pu1  [3]  =  0.3(0.4)  =  0.12  and  therefore,  pul  [0]  =  0.58.  The  expected 
value  becomes  E\U{\  =  2(0.3)  +  3(0.12)  =  0.96  and  therefore  the  expected  number 
of  points  scored  is 


E[JV(to)]E[C7i]  = 


XtoE[Ui] 

-^(2400)  (0.96) 

38.4  points  per  game. 


21.7  Computer  Simulation 

To  generate  a  realization  of  a  Poisson  random  process  on  a  computer  is  relatively 
simple.  It  relies  on  the  property  that  the  interarrival  times  are  IID  exp(A)  random 
variables.  We  observe  from  Figure  21.4  that  the  ith  interarrival  time  is  Z{  —  Ti  — 
Ti- 1,  where  Ti  is  the  ith  arrival  time.  Hence, 


—  T- 1  +  Zi 


where  we  define  To  =  0.  Each  Z{  has  the  PDF  exp(A)  and  the  Zi  s  are  IID.  Hence, 
to  generate  a  realization  of  each  Zi  we  use  the  inverse  probability  integral  transfor¬ 
mation  technique  (see  Section  10.9)  to  yield 


1 

1  -Ui 


where  Ui  ~  U( 0, 1)  and  the  Ui  s  are  IID.  A  typical  realization  using  the  following 
MATLAB  code  is  shown  in  Figure  21.5a  for  A  =  2.  The  arrivals  are  indicated  now 
by  +’s  for  easier  viewing.  If  we  were  to  increase  the  arrival  rate  to  A  =  5,  then  a 
typical  realization  is  shown  in  Figure  21.5b. 


clear  all 

rand (* state 3 ,0) 

lambda=2;  */,  set  arrival  rate 

T=5;  7.  set  time  interval  in  seconds 

for  i=l:1000 

z(i,l)=(l/lambda)*log(l/(l-rand(l,l)))  ;  #/.  generate  interarrival  times 
if  i==l  7,  generate  arrival  time 
t (i,l)=z(i)  ; 
else 

t(i,l)=t(i-l)+z(i,l) ; 

end 

if  t(i)>T  7,  test  to  see  if  desired  time  interval  has  elapsed 
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(a)  A  =  2 


(b)  A  =  5 


Figure  21.5:  Realizations  of  Poisson  random  process. 


break 

end 

end 

M=length(t)-1 ;  7,  number  of  arrivals  in  interval  [0,T] 
arrivals=t  (1  :M)  ;  70  arrival  times  in  interval  [0,T] 

21.8  Real-World  Example  —  Automobile  Traffic  Signal 
Planning 

An  important  responsibility  of  traffic  engineers  is  to  decide  which  intersections  re¬ 
quire  traffic  lights.  Although  general  guidelines  are  available  [Federal  Highway  Ad. 
1988],  new  situations  constantly  arise  that  warrant  a  reassessment  of  the  situation — 
principally  an  unusually  high  accident  rate  [Imada  2001].  In  this  example,  we  sup¬ 
pose  that  a  particular  intersection,  which  has  two  stop  signs,  is  prone  to  accidents. 
The  situation  is  depicted  in  Figure  21.6,  where  it  is  seen  that  the  two  intersecting 
streets  are  one-way  streets  with  a  stop  sign  at  the  corner  of  each  one.  A  traffic 
engineer  believes  that  the  high  accident  rate  is  due  to  motorists  who  ignore  the  stop 
signs  and  proceed  at  full  speed  through  the  intersection.  If  this  is  indeed  the  case, 
then  the  installation  of  a  traffic  light  is  warranted.  To  determine  if  the  accident  rate 
is  consistent  with  his  belief  that  motorists  are  “running”  the  stop  signs,  he  wishes 
to  determine  the  average  number  of  accidents  that  would  occur  if  this  is  true.  As 
shown  in  Figure  21.6,  if  2  vehicles  arrive  at  the  intersection  within  a  given  time 
interval,  an  accident  will  occur.  It  is  assumed  the  two  cars  are  identical  and  move 
with  the  same  speed.  The  traffic  engineer  then  models  the  arrivals  as  two  indepen- 
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Figure  21.6:  Intersection  with  two  automobiles  approaching  at  constant  speed. 


dent  Poisson  random  processes,  one  for  each  direction  of  travel.  A  typical  set  of  car 
arrivals  based  on  this  assumption  is  shown  in  Figure  21.7.  Specifically,  an  accident 
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Figure  21.7:  Automobile  arrivals. 

will  occur  if  any  two  arrivals  satisfy  |TEW  —  TNS  |  <  r,  where  TEW  and  TNS  refer 
to  the  arrival  time  at  the  center  of  the  intersection  from  the  east- west  direction  and 
the  north-south  direction,  respectively,  and  r  is  some  minimum  time  for  which  the 
cars  can  pass  each  other  without  colliding.  The  actual  value  of  r  can  be  estimated 
using  r  =  d/c,  where  d  is  the  length  of  a  car  and  c  is  its  speed.  As  an  example, 
if  we  assume  that  d  =  22  ft  and  c  =  44  ft/sec  (about  30  mph),  then  r  =  0.5  sec. 
An  accident  will  occur  if  two  arrivals  are  within  one-half  second  of  each  other.  In 
Figure  21.7  this  does  not  occur,  but  there  is  a  near  miss  as  can  be  seen  in  Figure 
21.8,  which  is  an  expanded  version.  The  east- west  car  arrives  at  t  =  2167.5  seconds 
while  the  north-south  car  arrives  at  t  =  2168.4  seconds. 
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2000  2100  2200  2300  2400  2500 

t  (sec) 


Figure  21.8:  Automobile  arrivals — expanded  version  of  Figure  21.7.  There  is  a  near 
miss  at  t  =  2168  seconds,  shown  by  the  dashed  vertical  line. 


We  now  describe  how  to  determine  the  average  number  of  accidents  per  day. 
This  can  be  obtained  by  defining  a  set  of  indicator  random  variables  (see  Example 
11.4)  as 


1  if  there  is  at  least  one  NS  arrival  with  |TEW  —  TNS  |  <  r 
0  otherwise. 


Here  TNS  can  be  any  NS  arrival  time  and  TEW  is  the  ith.  arrival  time  for  the  EW 
traffic.  (More  explicitly,  the  event  for  which  the  indicator  random  variable  is  1 
occurs  when  minj-i?2,...  |TEW  —  T^s\  <  r,  where  Tj^s  is  the  jth  arrival  for  the  NS 
traffic.)  Now  the  number  of  accidents  in  the  time  interval  [0,  t]  is 

N(t) 

x(t)  =  YsI*  (2L14) 

i= 1 

where  N(t)  is  the  Poisson  counting  random  process  for  the  EW  traffic.  To  find 
the  expected  value  of  X(t )  we  note  that  the  equation  (21.13),  although  originally 
derived  under  the  assumption  that  the  U{  s  are  IID,  is  also  valid  under  the  weaker 
assumption  that  the  means  of  the  U^s  are  the  same  as  shown  in  Problem  21.25. 
Since  the  I{  s  will  be  seen  shortly  to  have  the  same  mean,  the  expected  value  of 
(21.14)  is  from  (21.13)  with  U\  —  Ii 

E[X(t )]  =  XbE[h]. 

Now  to  evaluate  we  note  that 

E[Ii\  =p[\T?w  -Tns  |  <  r] 


(21.15) 


21.8.  AUTOMOBILE  TRAFFIC  SIGNAL  PLANNING 


731 


and  the  probability  can  be  found  using  a  conditioning  approach  (see  (13.12)).  This 
produces 


P[|lf w  -  T 


NS 


<  rl  = 


roc 

}  =  /  P[\T?W 
Jo 


-Tns  I  <  t|T„j 


EW 


t]pTi(t)dt. 


Proceeding  we  have  that 


P[\T‘ 


EW 


iNS 


r  OO 

<T}=  P[\t  -  Tns\  <  t\T™  =  t]pTt(t)dt 

Jo 

roc 

=  /  P[\t  —  Tns|  <  T]pTt{t)dt  are  independent) 

Jo 

-  f 


P[t  —  r  <  TiNb  <t  +  t]pt{  ( t)dt . 


(21.16) 


Note  that  t  —  r<  TNS  <t  +  r  is  the  event  that  the  NS  traffic  will  have  at  least  one 
arrival  (and  hence  an  accident)  in  the  interval  [t  —  r,  t  +  r].  Its  probability  is  just 

P[t  —  r  <  Tns  <t  +  r\  =  P[one  or  more  arrivals  in  [t  —  r,  t  +  r]] 

=  1  —  P[no  arrival  in  [t  —  r,t  +  r]] 

=  1  —  P[no  arrivals  in  [0, 2 r]]  (increment  stationarity) 

=  1  -  P[7V(2r)  =  0] 

=  1  —  exp(— 2Ar)  (from  (21.2)) 


and  is  not  dependent  on  t.  Thus, 

E[Ii]  =  P[\T?W  -  TNS|  <  r] 

(1  -  exp(-2A T))pTi{t)dt 


f 


(from  (21.16)) 


=  1  —  exp(— 2At) 


for  all  and  therefore  all  the  I{  s  have  the  same  mean.  Prom  (21.15) 

E[X(t)]  =  Xt(l  -  exp(— 2Ar)). 

For  the  same  example  as  before  with  r  —  0.5,  the  average  number  of  accidents  per 
second  is 

-  A(1  -  exp(-A)). 

For  a  more  meaningful  measure  we  convert  this  to  the  average  number  of  accidents 
per  hour,  which  is  (E[X(t)\/t) 3600.  This  is  plotted  versus  A',  where  A'  is  in  arrivals 
per  hour,  in  Figure  21.9.  Specifically,  it  is  given  by 

E[X(t)} 


3600 


t 


3600A(1  -  exp(-A)) 

A' 


A' 


1  —  exp 


3600 
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Figure  21.9:  Average  number  of  accidents  per  hour  versus  arrival  rate  (in  per  hour 
units) . 


where  A'  =  arrivals  per  hour  =  3600A.  As  seen  in  Figure  21.9  for  about  1  arrival 
every  3  minutes  or  20  arrivals  per  hour,  we  will  have  an  average  of  0.1  accidents 
per  hour  or  about  an  average  of  one  accident  every  two  days.  This  assumes  a 
busy  intersection  for  about  5  hours  per  day.  Thus,  if  the  traffic  engineer  notices  an 
accident  nearly  every  other  day,  he  will  request  that  a  traffic  light  be  put  in. 
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Problems 

21.1  (t)  Prove  that  the  differential  equation  describing  Pk{t)  =  P[N(t)  =  k]  for  a 
Poisson  counting  random  process  is  given  by  (21.3).  To  do  so  use  Figure  21.3 
with  either  k  arrivals  in  [0,  t  —  At]  and  no  arrivals  in  ( t  —  At,  t]  or  k  —  1  arrivals 
in  [0,  t  —  At]  and  one  arrival  in  (t  —  At,  t].  Since  there  can  be  at  most  one 
arrival  in  a  time  interval  of  length  At,  these  are  the  only  possibilities. 
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21.2  (t)  Solve  the  differential  equation  of  (21.3)  by  taking  the  (one-sided)  Laplace 
transform  of  both  sides,  noting  that  Pjfc(0+)  =  0.  Explain  why  the  latter 
condition  is  consistent  with  the  assumptions  of  a  Poisson  random  process. 
You  should  be  able  to  show  that  the  Laplace  transform  of  Pk  (t )  is 


Xk 

=  ( 8  +  A)*+! 


by  finding  V\ (s)  from  Vo(s ),  and  then  V2 (s)  from  V\ (s),  etc.  The  desired  in¬ 
verse  Laplace  transform  is  found  by  referring  to  a  table  of  Laplace  transforms. 


21.3  (0)(f)  Find  the  probability  of  6  arrivals  of  a  Poisson  random  process  in  the 
time  interval  [7, 12]  if  A  =  1.  Next  determine  the  average  number  of  arrivals 
for  the  same  time  interval. 


21.4  (w)  For  a  Poisson  random  process  with  an  arrival  rate  of  2  arrivals  per  second, 
find  the  probability  of  exactly  2  arrivals  in  5  successive  time  intervals  of  length 
1  second  each. 

21.5  (f)  What  is  the  probability  of  a  single  arrival  for  a  Poisson  random  process 
with  arrival  rate  A  in  the  time  interval  [t,  t  +  At]  if  At  0? 

21.6  (w)  Telephone  calls  come  into  a  service  center  at  an  average  rate  of  one  per  5 
seconds.  What  is  the  probability  that  there  will  be  more  than  12  calls  in  the 
first  one  minute? 

21.7(0)(f,c)  For  a  Poisson  random  process  with  an  arrival  rate  of  A  use  a  com¬ 
puter  simulation  to  estimate  the  arrival  rate  if  A  =  2  and  also  if  A  =  5.  To  do 
so  relate  A  to  the  average  number  of  arrivals  in  [0,  t].  Hint:  Use  the  MATLAB 
code  in  Section  21.7. 

21.8  (w)  Two  independent  Poisson  random  processes  both  have  an  arrival  rate  of 
A.  What  is  the  expected  time  of  the  first  arrival  observed  from  either  of  the 
two  random  processes?  Explain  your  results.  Hint:  Let  this  time  be  denoted 
by  T  and  note  that  T  =  mu^T^T®),  where  is  the  first  arrival  time  of 
the  ith  random  process.  Then,  note  that  P[T  >  t]  =  P[T^  >  t,  >  t]. 

21.9  (t)  In  this  problem  we  prove  that  the  sum  of  two  independent  Poisson  counting 
random  processes  is  another  Poisson  counting  random  process  whose  arrival 
rate  is  the  sum  of  the  arrival  rates  of  the  two  random  processes.  Let  the  Poisson 
counting  random  processes  be  Ni(t)  and  iV2(£)  and  consider  the  increments 
iV(t2)  —  N(ti)  and  N(ti)  —  N(ts)  for  nonoverlapping  time  intervals.  Argue  that 
the  corresponding  increments  for  the  sum  random  process  are  independent 
and  stationary,  knowing  that  this  is  true  for  each  individual  random  process. 
Then,  use  characteristic  functions  to  prove  that  if  Ni(t)  ~  Pois(Ai^)  and 
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N2(t)  Pois(A2^)  and  Ni(t)  and  N2(t)  are  independent,  then  Ni(t)  +iV2(t)  ~ 
Pois((Ai  +  A  2)t). 

21.10  (^)  (w)  If  N(t)  is  a  Poisson  counting  random  process,  determine  E[N(t2)  — 
N(t\)]  and  var(iV(t2)  —  N(ti)). 

21.11  (w)  Commuters  arrive  at  a  subway  station  that  has  3  turnstyles  with  the 
arrivals  at  each  turnstyle  characterized  by  an  independent  Poisson  random 
process  with  arrival  rate  of  A  commuters  per  second.  Determine  the  probability 
of  a  total  of  k  arrivals  in  the  time  interval  [0 ,  £].  Hint:  See  Problem  21.9. 

21.12  (t)  In  this  problem  we  present  an  alternate  proof  that  the  Poisson  random 
process  has  no  memory  as  described  by  (21.7).  It  is  based  on  the  observation 
that  a  Poisson  random  process  is  the  limiting  form  of  a  Bernoulli  random 
process  as  explained  in  Section  21.1.  Consider  first  the  geometric  PMF  of  the 
first  success  or  arrival  which  is  P[X  =  k]  —  (1  —  p)k~xp  for  k  =  1, 2,  —  Then 
show  that 

P[X  >  h+k2\x  >  ki]  =  (1  -p)k\ 

Next  let  p  =  A  A  t  and  k\  —  £i/At  and  =  £2 /At  and  prove  that  as  At  — >  0 
P[X  >  k\  -\-k2\X  >  k{\  —  P[XAt  >  k\ At  +  k2At\XAt  >  k\At]  — >  exp(— A£2)* 
Hint:  As  x  -*  0,  (1  —  ax)l/x  — >  exp(— a). 

21.13  (^)  (w)  Taxi  cabs  arrive  at  the  rate  of  1  per  minute  at  a  taxi  stand.  If  a 
person  has  already  waited  10  minutes  for  a  cab,  what  is  the  probability  that 
he  will  have  to  wait  less  than  1  additional  minute? 

21.14  (w)  A  computer  memory  has  the  capacity  to  store  106  words.  If  requests 
for  word  storage  follow  a  Poisson  random  process  with  a  request  rate  of  1 
per  millisecond,  how  long  on  average  will  it  be  before  the  memory  capacity  is 
exceeded? 


21.15  (t)  If  X\  ~  exp(A),  X2  ~  exp(A),  and  X\  and  X2  are  independent  random 
variables,  derive  the  PDF  of  the  sum  by  using  a  convolution  integral. 


21.16  (t)  We  give  an  alternate  derivation  of  the  PDF  for  the  fcth  arrival  time  of  a 
Poisson  random  process.  This  PDF  can  be  expressed  as 


lim 

Zit — ^0 


P[t  -  At  <  Tfe  <  t] 
_ 


Use  the  fact  that  the  event  {t  —  At  <  T/.  <  t}  can  only  occur  as  At  — >  0  if 
there  are  k  —  1  arrivals  in  [0,  t  —  At]  and  1  arrival  in  (t  —  At,  t]. 
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21.17  (^)  (w)  People  arrive  at  a  football  game  at  a  rate  of  100  per  minute.  If 
the  1000th  person  is  to  receive  a  seat  at  the  50th  yard  line  (which  is  highly 
desirable) ,  how  long  should  you  wait  before  entering  the  stadium? 

21.18  (t)  Prove  that  if  X ,  Y,  Z  are  jointly  distributed  continuous  random  variables, 
then  EXtY,z\g(X,Y,Z)]  =  Ez[Ex,Y\z[g(X,Y,  Z)\z]}  by  expressing  the  expec- 
tat  ions  using  integrals.  You  may  wish  to  refer  back  to  Section  13.6. 

21.19  (t)  The  Poisson  random  process  exhibits  the  Markov  property.  This  says 
that  the  conditional  probability  of  N ( t )  based  on  past  samples  of  the  random 
process  only  depends  upon  the  most  recent  sample.  Mathematically,  if  1 3  > 
t2  >  £1,  then 

P[N(ts)  =  ks\N(t2)  =  k2,N(h)  =  k\]  =  P[N(ts)  =  ks\N(t2)  =  k2]. 

Prove  that  this  is  true  by  making  use  of  the  property  that  the  increments  are 
independent.  Specifically,  consider  the  equivalent  probability 

P[N(t3)  -  N(t2)  =fo-  k2\N(t2)  =  fc2,  N(t\)  -  N(  0)  =  h] 
and  also  explain  why  this  probability  is  equivalent. 

21.20  (o)  (c)  Use  a  computer  simulation  to  generate  multiple  realizations  of  a 
Poisson  random  process  with  A  =  1.  Then,  use  the  simulation  to  estimate 
P[T2  <  1].  Compare  your  result  to  the  true  value.  Hint:  Use  the  MATLAB 
code  in  Section  21.7. 

21.21  (w)  An  airport  has  two  security  screening  lines.  An  employee  directs  the 
incoming  travelers  to  one  of  the  two  lines  at  random.  If  the  incoming  travelers 
arrive  at  the  airport  with  a  rate  of  A  travelers  per  second,  what  is  the  arrival 
rate  at  each  of  the  two  security  screening  lines?  What  assumptions  are  implicit 
in  arriving  at  your  answer? 

21.22  (t)  Prove  that  the  variance  of  a  compound  Poisson  random  process  is 
var(Y(to))  =  XtoE[Uf].  If  you  guessed  that  the  result  would  be  Atovar(Ui), 
then  evaluate  your  guess  for  a  Poisson  random  process  (let  Ui  =  1). 

21.23  (^)  (f)  A  compound  Poisson  random  process  X(t)  is  composed  of  random 
variables  Ui  that  can  take  on  the  values  ±1  with  P[Ui  =  1]  =  p.  What  is  the 
expected  value  of  X(t)? 

21.24  (c)  Perform  a  computer  simulation  to  lend  credibility  to  the  expected  number 
of  points  scored  in  the  basketball  game  described  in  Example  21.7. 
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21.25  (t)  Derive  (21.13)  for  the  case  where  the  Ui  s  have  the  same  mean  and  are 
independent  of  N(to).  Start  your  derivation  with  the  expression 

EUx,..., 

and  then  follow  the  same  approach  as  given  in  Section  21.6.  You  do  not  need 
the  characteristic  function  to  do  this. 


E[X(to)]  —  ^jv(io) 


Uk\N(to)  ^2  Ui  N(to)  - 


i—1 


Appendix  21 A 


Joint  PDF  for  Interarrival 
Times 


We  prove  in  this  appendix  that  the  first  two  interarrival  times  Z\ ,  Z2  are  IID  with 
Zi  ~  exp(A).  The  general  case  of  any  number  of  interarrival  times  can  similarly  be 
proven  to  be  IID  with  an  exp(A)  PDF.  We  now  refer  to  Figure  21.4  and  prove  that 
the  joint  CDF  factors  and  each  marginal  CDF  is  that  corresponding  to  the  exp  (A) 
PDF.  The  joint  CDF  is  given  as 

P[Zi  <  £1,  ^2  <  £2]  =  /  P[Z2  <  &\Zi  =  zfrzMdz!  (21A.1) 

which  follows  from  (13.12)  where  A  =  {Z2  :  Z2  <  £2}.  But  if  Z\  =  z\,  then  Z2  <  £2  if 
and  only  if  N[z\  +£2)  ~ N(z\)  >  1  since  an  arrival  must  have  occurred  in  [zi,zi  +£2]- 
Hence, 


P[Z2  <&\Zi=  z{\  =  P[N(z\  +  £2)  -  N(Zl)  >  1|ZX  =  Zl] 

and  because  the  event  Z\  —  z\  is  equivalent  to  the  increment  N(z\)  —  N{ 0)  =  1, 
and  the  increments  are  independent  and  stationary,  we  have 


P[Z2  <  blZ!  =  zx\  =  PiN^+^-N^yiiz^Zi] 

=  P[N{z\  +  £2)  —  N(zi)  >  1]  (independence) 

=  P[AT(£ 2)  >  1]  (stationarity). 


Using  this  in  (21A.1)  produces 
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P[Zi<tuZ2<h]  = 


ft 

/  P[N{&)  >  llpzMdzi 
Jo 

[  [1  -  P[JV(&)  <  1  ])pzx  (zi)dzi 
Jo 

f  (1  -  P[iV(6)  =  0 ])PZ!  (zi)dzi 
Jo 

ft 

/  (1  -  exp(-A^2))PZi  (zi)dzi 
Jo 

ft 

[1  -  exp(-A£2)]  /  PzAzi)dz\ 

Jo 

[1  -  exp(-A6)]P[^i  <  6] 
[l-exp(-A6)]P[iV(6)>l] 

[1  -  exp(— A£2)][l  -  P[N(£ i)  <  1]] 

[1  -  exp(— A£2)][l  -  exp(-A^i)] 

P[Z\  <  zi]P[Z2  <  z2J. 


It  is  seen  that  the  joint  CDF  factors  into  the  product  of  the  marginal  CDFs,  where 
each  marginal  is  the  CDF  of  an  exp(A)  random  variable.  Thus,  the  first  two  inter¬ 
arrival  times  are  IID  with  PDF  exp(A). 


Chapter  22 


Markov  Chains 


22.1  Introduction 

We  have  seen  in  Chapter  16  that  an  important  random  process  is  the  IID  random 
process.  When  applicable  to  a  specific  problem,  it  lends  itself  to  a  very  simple 
analysis.  A  Bernoulli  random  process,  which  consists  of  independent  Bernoulli  trials, 
is  the  archetypical  example  of  this.  In  practice,  it  is  found,  however,  that  there  is 
usually  some  dependence  between  samples  of  a  random  process.  In  Chapters  17  and 
18  we  modeled  this  dependence  using  wide  sense  stationary  random  process  theory, 
but  restricted  the  modeling  to  only  the  first  two  moments.  In  an  effort  to  introduce  a 
more  general  dependence  into  the  modeling  of  a  random  process,  we  now  reconsider 
the  Bernoulli  random  process  but  assume  dependent  samples.  We  briefly  introduced 
this  extension  in  Example  4.10  as  a  sequence  of  dependent  Bernoulli  trials.  The 
dependence  of  the  PMF  that  we  will  be  interested  in  is  dependence  on  the  previous 
trial  only.  This  type  of  dependence  leads  to  what  is  generically  referred  to  as  a 
Markov  random  process.  A  special  case  of  this  for  a  discrete-time/ discrete-valued 
(DTDV)  random  process  is  called  a  Markov  chain.  Specifically,  it  has  the  property 
that  the  probability  of  the  random  process  X[n\  at  time  n  =  uq  only  depends  upon 
the  outcome  or  realization  of  the  random  process  at  the  previous  time  n  =  n o  —  1.  It 
can  then  be  viewed  as  the  next  logical  step  in  extending  an  IID  random  process  to  a 
random  process  with  statistical  dependence.  Recall  from  Chapter  8  that  for  discrete 
random  variables  statistical  dependence  is  quantified  using  conditional  probabilities. 
The  reader  should  review  Example  4.10  and  also  Chapter  8  in  preparation  for  our 
discussion  of  Markov  chains. 

Although  we  will  restrict  our  description  to  a  DTDV  Markov  random  process, 
i.e.,  the  Markov  chain,  there  are  many  generalizations  that  are  important  in  practice. 
The  interested  reader  can  consult  the  excellent  books  by  [Bharucha-Reid  1988], 
[Cox  and  Miller  1965],  [Gallagher  1996]  and  [Parzen  1962]  for  these  other  random 
processes.  Before  proceeding  with  our  discussion  we  present  an  example  to  illustrate 
typical  concepts  associated  with  a  Markov  chain. 
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In  the  game  of  golf  it  is  very  desirable  to  be  a  good  putter.  The  best  golfers 
in  the  world  are  able  to  hit  a  golf  ball  lying  on  the  green  into  the  hole  using  only 
a  few  strokes,  called  putting  the  ball.  At  times  they  can  even  “one-putt”  the  ball, 
in  which  they  require  only  a  single  stroke  to  hit  the  ball  into  the  hole.  Of  course, 
their  chances  of  doing  so  rely  heavily  on  how  far  the  ball  is  from  the  hole  when 
they  first  reach  the  green.  If  the  ball  is  say  3  feet  from  the  hole,  then  they  will 
almost  always  one-putt.  If,  however,  it  is  near  the  edge  of  the  green,  possibly  20 
feet  from  the  hole,  then  their  chances  are  small.  For  our  hypothetical  golfer  we  will 
assume  that  her  chance  of  a  one-putt  is  50%  at  the  start  of  a  round  of  golf,  i.e.,  at 
hole  one.  If  she  one-putts  on  hole  one,  then  her  chances  on  hole  two  will  remain  at 
50%.  If  not,  she  becomes  somewhat  discouraged  which  reduces  her  chances  at  hole 
two  to  only  25%.  Hence,  at  each  hole  her  chances  of  a  one-putt  are  50%  if  she  has 
one-putted  the  previous  hole  and  25%  if  she  has  not.  To  model  this  situation  we 
let  X[n\  —  1  for  a  one-putt  at  hole  n  and  X[n]  =  0  otherwise.  We  label  hole  one 
by  n  =  0.  A  round  of  golf,  which  consists  of  18  holes,  produces  a  sequence  of  18 
l’s  and  0’s  with  a  1  indicating  a  one-putt.  For  the  probabilities  assumed  a  typical 
set  of  outcomes  is  shown  in  Figure  22.1.  Note  that  she  has  played  three  rounds  of 


Figure  22.1:  Outcomes  of  three  rounds  of  golf.  A  1  indicates  a  one-putt  on  hole  n. 

golf  or  54  holes,  of  which  18  were  one-putts.  It  appears  that  her  probability  of  a 
one-putt  is  closer  to  1/3  than  either  1/2  or  1/4.  Also,  it  is  of  interest  to  determine 
the  average  number  of  holes  played  between  one-putts.  The  actual  number  varies 
as  seen  in  Figure  22.1  and  is  {4, 1, 3, 3, 1, 1, 3, 1, 2, 3, 11, 3, 1, 3, 4, 1, 1}  for  an  average 
of  46/17  =  2.70.  It  would  seem  that  the  expected  number  of  holes  played  between 
one-putts,  about  3,  is  the  reciprocal  of  the  probability  of  a  one-putt,  about  1/3. 
This  suggests  a  geometric-type  PMF,  which  we  will  confirm  in  Section  22.6. 

Probabilistically,  we  are  observing  a  sequence  of  dependent  Bernoulli  trials.  The 
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dependence  arises  (in  contrast  to  the  usual  Bernoulli  random  process  which  had 
independent  trials)  due  to  the  probability  of  a  one-putt  at  hole  n  being  dependent 
upon  the  outcome  at  hole  n  —  1.  We  can  model  this  dependence  using  conditional 
probabilities  to  say  that 

P[one-putt  at  hole  n|no  one-putt  at  hole  n  —  1] 

P [one-putt  at  hole  n\ one-putt  at  hole  n  —  1] 
or 

P[A-[n]  =  l|X[n-l]  =  0]  =  \ 

f>[X[n]  =  l|X[n-l]  =  l]  =  i. 

Completing  the  conditional  probability  description,  we  have 

P[X[n]  =  0|X[n  -  1]  =  0]  =  l 
P[X[n]  =  l\X[n  -  1]  =  0]  =  \ 

P[X[n]  =  0|X[n  -  1]  =  1]  =  \ 

P[X[n]  =  l|X[n-l]  =  l]  =  i 

Note  that  we  have  assumed  that  the  conditional  probabilities  do  not  change  with 
“time”  (actually  hole  number).  Lastly,  we  require  the  initial  probability  of  a  one- 
putt  for  the  first  hole.  We  assign  this  to  be  P[X[0]  =  1]  =  1/2.  In  summary,  we 
have  two  sets  of  conditional  probabilities  and  one  set  of  initial  probabilities  which 
can  be  arranged  conveniently  using  a  matrix  and  vector  to  be 

r  P[X[n]  =  0| X[n  -  1]  =  0]  P[X[n]  =  l\X[n  -  1]  =  0]  1 
P[X[n]  =  0|X[n  -  1]  =  1]  P[X[n)  -  l\X[n  -  1]  =  1]  J  1  ’ 

3  1  " 

44 
1  1 
.2  2 

and 


1 

4 

1 

2 


P[X[ 0]  =  0]  ■ 

P[X[0]  =  1]  . 

1  1 


(22.2) 
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l 

4 


Figure  22.2:  Markov  state  probability  diagram  for  putting  example. 

The  probabilities  can  also  be  summarized  using  the  diagram  shown  in  Figure  22.2, 
where  for  example  the  conditional  probability  of  a  one-putt  on  hole  n  given  that  the 
golfer  has  not  one-putted  on  hole  n  —  1  is  1/4.  We  may  view  this  diagram  as  one 
in  which  we  are  in  “state”  0,  which  corresponds  to  the  previous  outcome  of  no  one- 
putt  and  will  move  to  “state”  1,  which  corresponds  to  a  one-putt,  with  a  conditional 
probability  of  1/4.  If  we  do  move  to  a  new  state,  it  means  the  outcome  is  a  1  and 
otherwise,  the  outcome  is  a  0.  In  interpreting  the  diagram  one  should  visualize  that 
a  0  or  1  is  emitted  as  we  enter  the  0  or  1  state,  respectively.  Then,  the  current  state 
becomes  the  last  value  emitted.  Also,  our  initial  unconditional  probabilities  of  1/2 
and  1/2  of  entering  state  0  or  state  1  are  shown  as  dashed  lines.  The  diagram  is  called 
the  Markov  state  probability  diagram.  The  use  of  the  term  “state”  is  derived  from 
physics  in  that  the  future  evolution  (in  terms  of  probabilities)  of  the  process  is  only 
dependent  upon  the  current  state  and  not  upon  how  the  process  arrived  in  that  state. 
The  probabilistic  structure  summarized  in  Figure  22.2  is  called  a  Markov  chain. 
As  mentioned  previously,  it  is  a  DTDV  random  process.  Although  we  have  used  a 
dependent  Bernoulli  random  process  as  an  example,  it  easily  generalizes  to  any  finite 
number  of  states.  It  is  common  in  the  discussion  of  Markov  chains  to  term  the  matrix 
of  conditional  probabilities  P  in  (22.1)  as  the  state  transition  probability  matrix  or 
more  succinctly  the  transition  probability  matrix.  The  initial  probability  vector  p[0] 
in  (22.2)  is  called  the  initial  state  probability  vector  or  more  succinctly  the  initial 
probability  vector.  Note  that  in  using  the  state  probability  diagram  to  summarize 
the  Markov  chain  we  will  henceforth  omit  the  initial  probability  assignment  in  the 
diagram  but  it  should  be  kept  in  mind  that  it  is  necessary  in  order  to  complete  the 
description. 

As  an  example  of  a  typical  probability  computation,  consider  the  probability  of 
X[0]  =  0,  X[l]  =  1,X[2]  =  1  versus  X[0]  =  1,X[1]  =  1,X[2]  =  1.  Then,  using  the 
chain  rule  (see  (4.10))  we  have 

P[X[ 0]  =  0,X[1]  =  1,  X[2]  =  1]  =  P[X[ 2]  =  1\X[1\  =  1,X[0]  =  0] 

•P[X[1]  =  1|X[0]  =  0]P[X[0]  =  0]. 
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But  due  to  the  assumption  that  the  probability  of  X[n]  only  depends  upon  the 
outcome  at  time  n  —  1,  which  is  called  the  Markov  property ,  we  have 


P[X[  2]  =  1\X[1]  =  1,X[0]  =  0]  =  P[X[  2]  =  1|X[1]  =  1] 

and  therefore 


P[X[ 0]  =  0,X[1]  =  1,X[2]  =  1]  =  P[X[2]  =  1|X[1]  =  1]P[X[1]  -  1|X[0]  =  0] 

•  Pim  =  0]. 


But  from  Figure  22.2  this  is 


P[X[0]  =0,X[1]  =  1,X[2  = 


Similary, 


1 

16 ' 


P[X[0]  =  1,X[1]  =  1,X[2]  =  1 


P[X[  2]  =  1\X[1]  = 

•  p[x[o]  =  1]  = 


1]P[X[1]  =  1|X[0]  =  1] 


We  see  that  joint  probabilities  are  easily  determined  from  the  initial  probabilities 
and  the  transition  probabilities.  If  we  are  only  interested  in  the  marginal  PMF  at  a 
given  time  say  P[X[n ]  =  k]  for  k  =  0,1,  as,  for  example,  P[X[ 2]  =  1],  we  need  only 
sum  over  the  other  variables  of  the  joint  PMF.  This  produces 


l  l 


P[X[  2]  =  1]  - 


P[X[0]=i,X[l]=j,X[2]  =  l] 


1=0  j= 0 
1  1 


i= 0  j= 0 


1  1 


P[X[  2]  =  1\X[0]  =  i,X[l]  =  j\P[X[  1]  =  j\X[0]  =  i] 

■  ^[^[0]  =  *] 


P[X[  2]  =  1|X[1]  =  j]P[X[  1]  =  j\X[0]  =  i}P[X[  0]  =  t] 

i= 0  j= 0 

(Markov  property) 

l  l 

=  y,  p\-xw  =  = 3]  E  = 3\  m  =  *i^[^[o] = *]  • 

3=0 


i= 0 


p[m=j] 


Note  that  P[X[1]  =  j]  can  be  found  and  then  used  to  find  P[X[2]  =  1].  Of  course, 
this  is  getting  somewhat  messy  algebraically  but  as  shown  in  the  next  section  the 
use  of  vectors  and  matrices  will  simplify  the  computation. 

Finally,  some  questions  of  interest  to  the  golfer  are: 
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1.  After  playing  many  holes,  will  the  probability  of  a  one-putt  settle  down  to  some 

constant  value?  Mathematically,  will  P[X[n]  =  k]  converge  to  some  constant 
PMF  as  n  -»  oo? 

2.  Given  that  the  golfer  has  just  one-putted,  how  many  holes  on  the  average  will  she 

have  to  wait  until  the  next  one-putt?  Or  given  that  she  has  not  one-putted, 
how  many  holes  on  the  average  will  she  have  to  wait  until  she  one-putts?  In 
the  first  case,  mathematically  we  wish  to  determine  if  given  X[riQ ]  =  1  and 
X[n o  +  1]  =  0, . . . ,  X[no  +  N  —  1]  =  0,  X[no  +  N]  =  1,  what  is  E[N]? 

We  will  answer  both  these  questions  shortly,  but  before  doing  so  some  definitions 
are  necessary. 

22.2  Summary 

A  motivating  example  of  a  Markov  chain  is  given  in  Section  22.1.  A  Markov  chain  is 
defined  by  the  property  of  (22.3).  The  state  transition  probabilities,  which  describe 
the  probabilities  of  movements  between  states,  is  given  by  (22.4).  When  arranged 
in  a  matrix  it  is  equivalent  to  (22.5)  for  a  two-state  Markov  chain  and  is  called 
the  transition  probability  matrix.  The  probabilities  of  the  states  are  defined  in 
(22.6)  and  succinctly  summarized  by  the  vector  of  (22.7)  for  a  two-state  Markov 
chain.  Table  22.1  summarizes  the  notational  conventions.  The  state  probability 
vector  can  be  found  for  any  time  by  using  (22.9).  To  evaluate  a  power  of  the 
transition  probability  matrix  (22.12)  can  be  used  if  the  eigenvalues  of  the  matrix 
are  distinct.  For  a  two-state  Markov  chain  the  state  probabilities  are  explicitly  found 
in  Section  22.4  with  the  general  transition  probability  matrix  given  by  (22.14).  For 
an  ergodic  Markov  chain  the  state  probabilities  approach  a  constant  value  as  time 
increases  and  this  value  is  found  by  solving  (22.17).  Also,  the  value  of  the  n-step 
transition  probability  matrix  approaches  the  steady-state  value  given  by  (22.19).  In 
Section  22.6  the  occupation  time  of  a  state  for  an  ergodic  Markov  chain  is  shown 
to  be  given  by  the  steady-state  probabilities  and  also,  the  mean  recurrence  time 
is  the  inverse  of  the  occupation  time.  An  explicit  solution  for  the  steady-state 
or  stationary  probabilities  can  be  found  using  (22.22).  The  MATLAB  code  for 
a  computer  simulation  of  a  3-state  Markov  chain  is  given  in  Section  22.8  while  a 
concluding  real-world  example  is  given  in  Section  22.9. 


22.3  Definitions 

We  restrict  ourselves  to  a  discrete-time  random  process  X[n]  with  K  possible  values 
or  states.  In  the  introduction  K  —  2  and  the  values  were  0, 1.  This  is  a  DTDV 
random  process  that  starts  at  n  =  0  (semi-infinite).  We  define  X[n]  as  a  Markov 
chain  if  given  the  entire  past  set  of  outcomes,  the  PMF  of  X[n]  depends  on  only  the 
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outcome  of  the  previous  sample  X[n  —  1]  so  that 

P[X[n\  =  j\X[n  -  1]  =  i,  X[n  -  2]  =  fc, . . . ,  X[0]  =  1}  =  P[X[n]  =  j|X[n  -  1]  =  i]. 
Using  the  concept  of  a  PMF  this  is  equivalent  to 

PX[n\\X[n-l],...,X[0]  =  PX[n]\X[n-l]-  (22.3) 

This  implies  that  the  joint  PMF  only  depends  on  the  product  of  the  first-order 
conditional  PMFs  and  the  initial  probabilities,  for  example 

Px  [0]  ,X[1],X[2]  =  PX[2]\X[1],X  [0  ]PX  [1]\X  [0  ]PX  [0] 

=  PX[2]\X[l]  PX[1] \X[Q]  PXjty  * 

conditional  conditional  initial 

probability  probability  probability 

As  mentioned  previously,  this  is  an  extension  of  the  idea  of  independence  in  that 
it  asserts  a  type  of  conditional  independence.  Most  importantly,  the  joint  PMF  is 
obtained  as  the  product  of  first-order  conditional  PMFs.  An  example  follows. 

Example  22.1  —  A  coin  with  memory 

Assume  that  a  coin  is  tossed  three  times  with  the  outcome  of  a  head  represented  by 
a  1  and  a  tail  by  a  0.  If  the  coin  has  memory  and  is  modeled  by  the  state  probability 
diagram  of  Figure  22.2,  determine  the  probability  of  the  sequence  HTH.  Note  that 
the  conditional  probabilities  are  equivalent  to  those  in  Example  4.10.  Writing  the 
joint  probability  in  the  more  natural  order  of  increasing  time,  we  have 

P[X[ 0]  =  1,X[1]  -  0,X[2]  =  1]  =  P[X[ 0]  =  1\P[X[1]  =  0\X[0]  -  1] 

•P[X[  2]  =  1\X[1]  =0] 

/i\  /i\  m  _  j_ 

V2 y  Vv  Vly  “  16* 

Hence,  the  sequence  HTH  is  less  probable  than  for  a  fair  coin  without  memory  for 
which  3  independent  tosses  would  yield  a  probability  of  1/8.  Can  you  explain  why 
this  is  less  probable? 

0 

We  will  now  use  the  terminology  of  the  introduction  to  refer  to  the  conditional 
probabilities  P[X[n]  =  j\X[n  —  1]  =  i]  as  the  state  transition  probabilities.  Note 
that  they  are  assumed  not  to  depend  on  n  and  therefore  the  Markov  chain  is  said 
to  be  homogeneous.  To  simplify  the  notation  further  and  to  prepare  for  subsequent 
probability  calculations  we  denote  the  state  transition  probabilities  as 

Pij  =  P[X[n]  =  j\X[n  -  1]  =  i]  i  =  0, 1, . . .  ,K  -  1;  j  =  0, 1, . . . ,  K  -  1.  (22.4) 

This  is  the  conditional  probability  of  observing  an  outcome  j  given  that  the  previous 
outcome  was  i.  It  is  also  said  that  P{j  is  the  probability  of  the  chain  moving  from 


746 


CHAPTER  22.  MARKOV  CHAINS 


state  i  to  state  j,  but  keep  in  mind  that  it  is  a  conditional  probability.  In  the  case 
of  a  two-state  Markov  chain  or  K  =  2,  we  have  i  =  0,  l;j  =  0, 1  and  the  state 
transition  probabilities  are  most  conveniently  arranged  in  a  matrix  P.  Prom  (22.1) 
we  have 


Poo  Pol 
Pio  Pi! 


(22.5) 


which  as  previously  mentioned  is  the  transition  probability  matrix.  Note  that  the 
sum  of  the  elements  along  each  row  must  be  one  since  they  represent  all  the  values 
of  a  conditional  PMF.  In  accordance  with  the  assumption  of  homogeneity  P  is  a 
constant  matrix.  Finally,  we  define  the  state  probabilities  at  time  n  as 


Pi[n ]  =  P[X[n\  =  i]  i  =  0, 1, . . . ,  K  —  1. 


(22.6) 


This  is  the  probability  of  observing  an  outcome  i  at  time  n  or  equivalently  the  PMF 
of  X[n\.  This  notation  is  somewhat  at  odds  with  our  previous  notation,  which  would 
be  Px[n][^\’>  but  is  a  standard  one.  The  PMF  depends  on  n  and  it  is  this  PMF  that 
we  will  be  most  concerned.  In  particular,  how  the  PMF  changes  with  n  will  be  of 
interest.  Hence,  a  Markov  chain  is  in  general  a  nonstationary  random  process.  For 
ease  of  notation  and  later  computation  we  also  define  the  state  probability  vector  for 


K  —  2  as 


(22.7) 


A  summary  of  these  definitions  and  notation  is  given  in  Table  22.1.  An  exam¬ 
ple  is  given  next  to  illustrate  the  utility  of  definitions  (22.4)  and  (22.6)  and  their 
vector/matrix  representations  of  (22.5)  and  (22.7). 


Example  22.2  -  Two-state  Markov  chain 

Consider  the  computation  of  P[X[n\  =  j]  for  a  two-state  Markov  chain  (K  =  2). 
Then, 


l 

P[X[n]=j]  =  '£P[X[n-l]  =  i,X[n]=j] 

i= 0 
1 


=  =  j\X[n  -  1]  =  i]P[X[n  -  1]  =  i] 

i= 0 

which  can  now  be  written  as 


Pi M  =  Tj  PijPi tn  “  ^  i  =  0) 1  • 


i= 0 


In  vector /matrix  notation  we  have 


[  PoM  Pi[n]  ]  =  [  p0[n  -  1]  pi[n  -  1]  ] 


-V*" 

.T I 


pT[n] 


Pm  Poi 
Pi  o  Pn 


■v^ 

p 
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Terminology 

Description 

Notation 

Random  process 

DTDV 

X[n ]  n  —  0, 1, . . . 

State 

Sample  space 

T— 1 

1 

• 

• 

• 

*■> 

rH 

O 

II 

State 

probability 

vector 

PMF  of  X[n] 

PM  =  \po[n]  •  ..pK-i[n]]T 

pk[n]  =  P[X[n ]  =  k] 

State  transition 
probability  matrix 

Conditional  prob. 

Poo  Poi  ■  ■  ■  Po,K-l 

„  Pio  P\l  ■  ■  ■  P\,K-l 

P  = 

*  *  .  • 

•  •  •  • 

•  •  *  • 

L  Pk- i,o  Pk- i,i  •  •  •  Pk-i,k-i  J 

Pij  =  P[X[n]  =  j\X[n  -  1}  =  i] 

Initial  state 
probability  vector 

PMF  of  X[0] 

P[0] 

Table  22.1:  Markov  chain  definitions  and  notation. 


or 

p  T[n]  =  p  T[n  —  1]P.  (22.8) 

The  evolution  of  the  state  probability  vector  in  time  is  easily  found  by  post-multiplying 
the  previous  state  probability  vector  (in  row  form)  by  the  transition  probability  ma¬ 
trix. 

0 

Note  that  we  have  defined  p[n]  as  a  column  vector  in  accordance  with  our  usual 
convention.  Other  textbooks  may  use  row  vectors.  A  numerical  example  follows. 

Example  22.3  —  Golfer  one-putting 

Prom  Figure  22.2  we  have  the  transition  probability  matrix  and  initial  state  prob¬ 
ability  vector  as 
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To  find  p[l]  we  use  (22.8)  to  yield 


l 

4 

1 

2 


.A.S  expected  the  elements  of  p£lj  sum  to  one.  .A.lso,  note  that  —  3/8  ^  1/2, 
which  means  that  initially  the  probability  of  a  one-putt  is  1/2  but  after  the  first 
hole,  it  is  reduced  to  3/8.  Can  you  explain  why?  We  can  continue  in  this  manner 
to  compute  the  state  probability  vector  for  n  —  2  as 


pT[2]  =  pT[l]P 

=  [531 

L  8  8  J 

_  r  21  n  i 

.  L  32  32  J 


3 

4 
1 
2 


1 

4 

1 

2 


and  so  forth  for  all  n. 


0 


22.4  Computation  of  State  Probabilities 

We  are  now  in  a  position  to  determine  p  [n]  for  all  n.  The  key  of  course  is  the 
recursion  of  (22.8).  In  a  slightly  more  general  form  where  we  wish  to  go  from  p[ni] 
to  p[n2],  the  resulting  equations  are  known  as  the  Chapman- Kolmogorov  equations. 
For  example,  if  n2  =  n\  +  2,  then 

pT[n2]  =  pr[n2  -  1]P 

=  (pT[n2-2]P)P 
=  pT[ni]P2. 

The  matrix  P2  is  known  as  the  two-step  transition  probability  matrix.  It  allows  the 
state  probabilities  for  two  steps  into  the  future  to  be  found  if  we  know  the  state 
probabilities  at  the  current  time.  In  general,  then  we  see  that 

pT[ni  +  n]  =  pT[m]Pn 

as  is  easily  verified,  where  Pn  is  the  n-step  transition  probability  matrix.  In  partic¬ 
ular,  if  ni  =  0,  then 

pT[n]  =  pT[0]Pn  n  =  1,2,...  (22.9) 

which  can  be  used  to  find  the  state  probabilities  for  all  time.  These  probabilities  can 
exhibit  markedly  different  behaviors  depending  upon  the  entries  in  P.  To  illustrate 
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a 


P 

Figure  22.3:  General  two-state  probability  diagram. 


this  consider  the  two-state  Markov  chain  with  the  state  probability  diagram  shown 
in  Figure  22.3.  This  corresponds  to  the  transition  probability  matrix 


1  —  a  ol 


P  1~P 


(22.10) 


where  0  <  a  <  1  and  0  <  ft  <  1.  As  always  the  rows  sum  to  one.  We  give  an 
example  and  then  generalize  the  results. 

Example  22.4  -  State  probability  vector  computation  for  all  n 

Let  a  =  p  =  1/2  and  pT[0]  =  [10]  so  that  we  are  intially  in  state  0  and  the 
transition  to  either  of  the  states  is  equally  probable.  Then  from  (22.9)  we  have 


Clearly,  pT[n\  =  [  \  \  ]  for  all  n  >  1.  The  Markov  chain  is  said  to  be  in  steady- 
state  for  n  >  1.  In  addition,  for  n  >  1,  the  PMF  pT[n]  =  [  \  \  ]  is  called  the 
steady-state  PMF. 

0 

More  generally,  the  state  probabilities  of  a  Markov  chain  may  or  may  not  approach 
a  steady-state  value.  It  depends  upon  the  form  of  P.  To  study  the  behavior  more 
thoroughly  we  require  a  means  of  determining  Pn.  To  do  so  we  next  review  the 
diagonalization  of  a  matrix  using  an  eigenanalysis  (see  also  Appendix  D). 
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Computing  Powers  of  P 


Assuming  that  the  eigenvalues  of  P  axe  distinct,  it  is  possible  to  find  eigenvectors 
Vj  that  are  linearly  independent.  Arranging  them  as  the  columns  of  a  matrix  and 
assuming  that  K  =  2,  we  have  the  modal  matrix  V  =  [vi  V2]  which  is  a  nonsingular 
matrix  since  the  eigenvectors  are  linearly  independent.  Then  we  can  write  that 

V_1PV  =  A  (22.11) 

where  A  =  diag(Ai,  A2)  and  A,  is  the  eigenvalue  corresponding  to  the  ?'th  eigenvector 
of  P.  Now  from  (22.11)  we  have  that  P  =  VAV  1  and  therefore,  the  powers  of  P 
can  be  found  as  follows. 

P2  =  (VAV-1)(VAV-1)  =  VA2V-1 
p3  =  p2p  =  (VA2V-1)VAV_1  =  VA3V_1 


and  in  general  we  have  that 

Pn  =  VAnV-1. 

But  since  A  is  a  diagonal  matrix  its  powers  are  easily  found  as 


(22.12) 


A?  0 
0  A” 


and  finally  we  have  that 


A?  0 
0  A^ 


(22.13) 


It  should  be  observed  that  the  eigenvectors  need  not  be  normalized  to  unity  length 
for  (22.13)  to  hold.  As  an  example,  if 


il 


then  the  eigenvalues  are  found  from  the  characteristic  equation  as  the  solutions  of 
det(P  —  AI)  =  0.  This  yields  the  equation  (1/2  —  A)(l  —  A)  =  0  which  produces 
Ai  —  1/2  and  A2  =  1.  The  eigenvectors  are  found  from 


"  0  b  ' 

'  1  ' 

(P  -  Ail)vi  = 

2 

,°  i. 

Vl  =  0  =r-  Vl  = 

0 

r  1  1  1 

'  1  ‘ 

(P  -  A2I)v2  = 

2  2 

°  0 

v2  =  0  =>  v2  = 

1 
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and  hence  the  modal  matrix  and  its  inverse  are 


V  =  [  vi  v2  ]  = 


1  1 
0  1 


V-1  = 


1  -1 

0  1 


Finally,  for  n  >  1  we  can  easily  find  the  powers  of  P  from  (22.13)  as 


pn  — 


i  i 
0  1 

(*r 

0 


1 

0 

e 

rH|CN 

1 _ 

'  1 

-11 

0  1 

1 

0 

- 1 

t-H 

1  \n 


i-(h 

i 


This  can  easily  be  verified  by  direct  multiplication  of  P. 


Now  returning  to  the  problem  of  determining  the  state  probability  vector  for 
the  general  two-state  Markov  chain,  we  need  to  first  find  the  eigenvalues  of  (22.10). 
The  characteristic  equation  is 


det(P  —  AI)  =  det 


1  —  ol  —  A  a 
P  1-/3-A 


=  0 


which  produces  (1  —  a  —  A)  (1  —  P  —  A)  —  a/3  =  0  or 

A2  +  (q  +  P  —  2)A  +  (1  —  ct  —  P)  =  0. 

Letting  r  =  a  +  /?,  which  is  nonnegative,  we  have  that  A2  +  (r  —  2)  A  4-  (1  —  r)  =  0 
for  which  the  solution  is 


A 


-(r  -  2)  ±  y/(r  -  2)2  -  4(1  -  r) 

2 

—  (r  —  2)  ±  r 


=  1  and  1  —  r. 

Thus,  the  eigenvalues  are  Ai  =  1  and  A2  =  1  —  a  —  (3.  Next  we  determine  the 
corresponding  eigenvectors  as 


(P  -  Ail)vi  = 


(P  -  A2I)v2 


-ol  a 

(S  -P 

ft  a 
{5  a 


vi  =  0  vi  = 


1 

1 


v2  =  0  =>  v2 


§_ 

a 
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and  therefore  the  modal  matrix  and  its  inverse  are 


i  r  -i 

\  (3  /  Qi  —  1 


With  the  matrix 


we  have 


1  0 

0  (1  —  a  —  /3)n 


P 


n 


1 

1  +  /?/ Oi 


1  0 
0  (1  -a-/3)n 


and  after  some  algebra 


r 

a 

a 

a 

p7l  _ 

a+P 

P 

a+P 

a 

+ 

1 

a+P 

P 

a+P 

P 

-  ot+P 

a+p  _ 

a+P 

a+P  . 

(22.14) 


We  now  examine  three  cases  of  interest.  They  are  distinguished  by  the  value  that 
A2  =  1  —  Oi — /3  takes  on.  Clearly,  as  seen  from  (22.14)  this  is  the  factor  that  influences 
the  behavior  of  Pn  with  n.  Since  a  and  (3  are  both  conditional  probabilities  we  must 
have  that  0  <  a+f3  <  2  and  hence  —1  <  A2  =  1— a— f3  <  1.  The  cases  are  delineated 
by  whether  this  eigenvalue  is  strictly  less  than  one  in  magnitude  or  not. 


Case  1.  — 1<1  —  a  —  f3  <  1 

Here  |1  —  a  —  /3\  <  1  and  therefore  from  (22.14)  as  n 


00 


pn 


P 


a 


a+P 

P 


a+p 

a 


L  ct+fi  a+P  J 


As  a  result, 


pTN  =  pT[o]Pra  [  po[0]  Pi  [0]  ] 


JL 


Q!+/3 

P 


P 


a 


a+fi  a+P 


a 


OL+P 

a 


L  a+p  a+p 


(22.15) 


for  any  p[0].  Hence,  the  Markov  chain  approaches  a  steady-state  irregardless 
of  the  initial  state  probabilities.  It  is  said  to  be  an  ergodic  Markov  chain , 
the  reason  for  which  we  will  discuss  later.  Also,  the  state  probability  vector 
approaches  the  steady-state  probability  vector  pr[oo],  which  is  denoted  by 


=  [  *0  7Tl  ] 


P 


a 


a+p  a+p 


(22.16) 


Finally,  note  that  each  row  of  P"  becomes  the  same  as  n  — >  oo. 
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Case  2.  1— a  —  /3  =  lora  =  {3  =  0 

If  we  draw  the  state  probability  diagram  in  this  case,  it  should  become  clear 
what  will  happen.  This  is  shown  in  Figure  22.4a,  where  the  zero  transition 
probability  branches  are  omitted  from  the  diagram.  It  is  seen  that  there  is  no 
chance  of  leaving  the  initial  state  so  that  we  should  have  p[n]  =  p[0]  for  all  n. 
To  verify  this,  for  a  =  f3  =  0,  the  eigenvalues  are  both  1  and  therefore  A  =  I. 
Hence,  P  =  I  and  Pn  =  I.  Here  the  Markov  chain  also  attains  steady-state 
and  7 r  =  p[0]  but  the  steady-state  PMF  depends  upon  the  initial  probability 
vector ,  unlike  in  Case  1.  Note  that  the  only  possible  realizations  are  0000 . . . 
and  1111 . . .. 


a  —  1 


(a)  a  =  =  0  (b)  a  =  j3  =  1 


Figure  22.4:  State  probability  diagrams  for  anomalous  behaviors  of  two-state 
Markov  chain. 


Case  3.  1  —  a  —  (3  =  —1  or  a  =  /3  =  1 

It  is  also  easy  to  see  what  will  happen  in  this  case  by  referring  to  the  state 
probability  diagram  in  Figure  22.4b.  The  outcomes  must  alternate  and  thus 
the  only  realizations  are  0101 . . .  and  1010 . . .,  with  the  realization  generated 
depending  upon  the  initial  state.  Unlike  the  previous  two  cases,  here  there  are 
no  steady-state  probabilities  as  we  now  show.  From  (22.14)  we  have 


for  n  even 


for  n  odd. 


Hence,  the  state  probability  vector  is 
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pT[n]  =  pT[0]Pra  =  [  p0[0]  Pi  [0]  ]  Pn 

{[  Po[0]  pi[0]  ]  for  n  even 
[  Pi[0]  po[0]  ]  for  n  odd. 


As  an  example,  if  pT[0]  =  [  1/4  3/4  ],  then 

T  f  [  \  §  ]  for  n  even 

p  [n]  =  <  r  q  in 

\  [  |  |  ]  for  n  odd 

as  shown  in  Figure  22.5.  It  is  seen  that  the  state  probabilities  cycle  between 
two  PMFs  and  hence  there  is  no  steady-state. 


(a)  p0[n]  =  P[X[n]  =  0] 


(b)  pi[n]  =  P[X[n]  =  1] 


Figure  22.5:  Cycling  of  state  probability  vector  for  Case  3. 


The  last  two  cases  are  of  little  practical  importance  for  a  two-state  Markov  chain 
since  we  usually  have  0  <  a  <  1  and  0  <  /?  <  1.  However,  for  a  K- state  Markov 
chain  it  frequently  occurs  that  some  of  the  transition  probabilities  are  zero  (corre¬ 
sponding  to  missing  branches  of  the  state  probability  diagram  and  an  inability  of  the 
Markov  chain  to  transition  between  certain  states).  Then,  the  dependence  upon  the 
initial  state  and  cycling  or  periodic  PMFs  become  quite  important.  The  interested 
reader  should  consult  [Gallagher  1996]  and  [Cox  and  Miller  1965]  for  further  details. 
We  next  return  to  our  golfing  friend. 

Example  22.5  -  One-putting 

Recall  that  our  golfer  had  a  transition  probability  matrix  given  by 
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It  is  seen  from  (22.10)  that  a  =  1/4  and  /?  =  1/2  and  so  this  corresponds  to  Case 
1  in  which  the  same  steady-state  probability  is  reached  regardless  of  the  initial 
probability  vector.  Hence,  as  n  oo,  Pn  will  converge  to  a  constant  matrix  and 
therefore  so  will  p  [n].  After  many  rounds  of  golf  the  probability  of  a  one-putt  or 
of  going  to  state  1  is  found  from  the  second  element  of  the  stationary  probability 
vector  7 r.  This  is  from  (22.16) 


t  r  l 

7T  =  7Tq  7T]  = 


P  a 
a+/3  a+/3 

1/2  1/4 

3/4  3/4 

[211 
L  3  3  J 


so  that  her  probability  of  a  one-putt  is  now  only  1/3  as  we  surmised  by  examination 
of  Figure  22.1.  At  the  first  hole  it  was  pi[0]  =  1/2.  To  determine  how  many  holes 
she  must  play  until  this  steady-state  probability  is  attained  we  let  this  be  n  =  nss 
and  determine  from  (22.14)  when  (1  —  a  —  /3)nss  =  (l/4)nss  «  0.  This  is  about 
nss  =  10  for  which  (1/4) 10  =  10“6.  The  actual  state  probability  vector  is  shown  in 
Figure  22.6  using  an  initial  state  probability  of  pT[0]  =  [1/2  1/2].  The  steady-state 
values  of  7r  =  [2/3  1/3]t  are  also  shown  as  dashed  lines. 


n  n 


(a)  p0[n]  =  P[X[n]  =  0] 


(b)  pi[n]  =  P[X[n]  =  l] 


Figure  22.6:  Convergence  of  state  probability  vector  for  Case  1  with  a  =  1/4  and 

P  =  1/2. 
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22.5  Ergodic  Markov  Chains 


We  saw  in  the  previous  section  that  as  n  oo,  then  for  some  P  the  state  probability 
vector  approaches  a  steady-state  value  irregardless  of  the  initial  state  probabilities. 
This  was  Case  1  for  which  each  element  of  P  was  nonzero  or 


1  —  a  a 

P  1-/3 


>0 


where  the  “>  0”  is  meant  to  indicate  that  every  element  of  P  is  greater  than  zero. 
Equivalently,  all  the  branches  of  the  state  probability  diagram  were  present.  A 
Markov  chain  of  this  type  is  said  to  be  ergodic  in  that  a  temporal  average  is  equal 
to  an  ensemble  average  as  we  will  later  show.  The  key  requirement  for  this  to  be 
true  for  any  AT-state  Markov  chain  is  that  the  AT  x  K  transition  probability  matrix 
satisfies  P  >  0.  The  matrix  P  then  has  some  special  properties.  We  already  have 
pointed  out  that  the  rows  must  sum  to  one;  a  matrix  of  this  type  is  called  a  stochastic 
matrix ,  and  for  ergodicity,  we  must  have  P  >  0;  a  matrix  satisfying  this  requirement 
is  called  an  irreducible  stochastic  matrix.  The  associated  Markov  chain  is  known 
as  an  ergodic  or  irreducible  Markov  chain.  A  theorem  termed  the  Perron-Frobenius 
theorem  [Gallagher  1996]  states  that  if  P  >  0,  then  the  transition  probability  matrix 
will  always  have  one  eigenvalue  equal  to  1  and  the  remaining  eigenvalues  will  have 
magnitudes  strictly  less  than  1.  Such  was  the  case  for  the  two-state  probability 
transition  matrix  of  Case  1  for  which  Ai  =  1  and  | A2I  =  |1  —  a  —  /3\  <  1.  This 
condition  on  P  assures  convergence  of  Pn  to  a  constant  matrix.  Convergence  may 
also  occur  if  some  of  the  elements  of  P  are  zero  but  it  is  not  guaranteed.  A  slightly 
more  general  condition  for  convergence  is  that  Pn  >  0  for  some  n  (not  necessarily 
n  —  1).  An  example  is 


(see  Problem  22.13). 

We  now  assume  that  P  >  0  and  determine  the  steady-state  probabilities  for  a 
general  AT-state  Markov  chain.  Since 

pT[n]  =  p  T[n  —  1]P 

and  in  steady-state  we  have  that  p T[n  —  1]  =  p T[n]  =  pT[oo],  it  follows  that 

PT[°°]  =  pT[oo]P. 


Letting  the  steady-state  probability  vector  be  7r  =  p[oo],  we  have 


7T 


T 


(22.17) 


and  we  need  only  solve  for  7 r.  An  example  follows. 
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Example  22.6  -  Two-state  Markov  chain 

We  solve  for  the  steady-state  probability  vector  for  Case  1.  Prom  (22.17)  we  have 


so  that 


7Tq  7Ti  =  7Tq  7Tl 


1  —  a  a 

P  1-P 


7Tq  =  (1  —  a)7To  +  /?7Ti 

7ri  =  ano  +  (1  -  P)tti 


or 


0  =  —  ano  +  P'Kx 

0  =  07TQ— /?7Tl. 


The  yields  71*1  =  (a//?)^  since  the  two  linear  equations  are  identical.  Of  course,  we 
also  require  that  7ro  +  Tl  =  1  and  so  this  forms  the  second  linear  equation.  The 
solution  then  is 


7TQ  = 


7Ti  = 


P 

Oi  +  ft 

a 

ca  +  (3 


(22.18) 


and  agrees  with  our  previous  results  of  (22.16). 

0 

It  can  further  be  shown  that  if  a  steady-state  probability  vector  exists  (which  will  be 
the  case  if  P  >  0),  then  the  solution  for  7 r  is  unique  [Gallagher  1996].  Finally,  note 
that  if  we  intialize  the  Markov  chain  with  p[0]  =  7 r,  then  since  pT[l]  =  pT[0]P  = 
7rTP  =  7 rT,  the  state  probability  vector  will  be  7r  for  n  >  0.  The  Markov  chain 
is  then  stationary  since  the  state  probability  vector  is  the  same  for  all  n  and  7 r  is 
therefore  referred  to  as  the  stationary  probability  vector.  We  will  henceforth  use  this 
terminology  for  7r. 

Another  observation  of  importance  is  that  if  P  >0,  then  Pn  converges,  and 
it  converges  to  P°°,  whose  rows  are  identical.  This  was  borne  out  in  (22.15)  and 
is  true  in  general  (see  Problem  22.17).  (Note  that  this  is  not  true  for  Case  2  in 
which  although  Pn  converges,  it  converges  to  I,  whose  rows  are  not  the  same.)  As 
a  result  of  this  property,  the  steady-state  value  of  the  state  probability  vector  does 
not  depend  upon  the  initial  probabilities  since 


-» 


pT[0]Pn 


r 

a 

r  a 

a  1 

{  Po[0]  Pi[0]  ] 

a+/3 

0 

_  a+P 

a+/3 

a 

a+P  _ 

+  pr[0](l  -a-  p)n 

V 

_ 1 

- 1 

4-  ^ 

e 

— as  n— »oo 


P  O' 
Oi-{-f3  ol~ \~P 
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independent  of  p7  [0] .  Also,  as  previously  mentioned,  if  P  >  0,  then  as  n  — >  oo 


i  n 


P 


a 


OL+P 

P 


OL+P 

a 


L  a+/3  a+P  J 


whose  rows  are  identical.  As  a  result,  we  have  that 


lim  [P 


ni 


n— »  oo 


Hj 


IT' 


j  =  0,1,...,#  — 1. 


(22.19) 


Hence,  the  stationary  probabilities  may  be  obtained  either  by  solving  the  set  of 
linear  equations  as  was  done  for  Example  22.6  or  by  examining  a  row  of  Pn  as 
n  — »  oo.  In  Section  22.7  we  give  the  general  solution  for  the  stationary  probabilities. 
We  next  give  another  example. 

Example  22.7  —  Machine  failures 

A  machine  is  in  operation  at  the  beginning  of  day  n  =  0.  It  may  break  during 
operation  that  day  in  which  case  repairs  will  begin  at  the  beginning  of  the  next 
day  (n  =  1).  In  this  case,  the  machine  will  not  be  in  operation  at  the  beginning 
of  day  n  =  1.  There  is  a  probability  of  1/2  that  the  technician  will  be  able  to 
repair  the  machine  that  day.  If  it  is  repaired,  then  the  machine  will  be  in  operation 
for  day  n  =  2  and  if  not,  the  technician  will  again  attempt  to  fix  it  the  next  day 
(n  =  2).  The  probability  that  the  machine  will  operate  without  a  failure  during  the 
day  is  7/8.  After  many  days  of  operation  or  failure  what  is  the  probability  that  the 
machine  will  be  working  at  the  beginning  of  a  day?  Here  there  are  two  states,  either 
X[n\  =  0  if  the  machine  is  not  in  operation  at  the  beginning  of  day  n,  or  X[n\  =  1  if 
the  machine  is  in  operation  at  the  beginning  of  day  n.  The  transition  probabilities 
are  given  as 


Poi  —  P[machine  operational  on  day  n|machine  nonoperational  on  day  n 

Pn  =  P[machine  operational  on  day  n|machine  operational  on  day  n  —  1] 

and  so  the  state  transition  probability  matrix  is 

’I  I  ' 

p  __  2  2 

II 
.8  8 

noting  that  poo  =  1  ~  Poi  =  1/2  and  pio  =  1  —  pn  =  1/8.  This  Markov  chain  is 
shown  in  Figure  22.7.  Since  P  >  0,  a  steady-state  is  reached  and  the  stationary 
probabilities  are  from  (22.18) 


7T0  = 


7Ti  = 


1 

5 
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0  -  machine  nonoperational  at  beginning  of  day 
1  -  machine  operational  at  beginning  of  day 

Figure  22.7:  State  probability  diagram  for  Example  22.7. 


The  machine  will  be  in  operation  at  the  beginning  of  a  day  with  a  probability  of 

0.8. 

0 

Note  that  in  the  last  example  the  states  of  0  and  1  are  arbitrary  labels.  They  could 
just  as  well  have  been  “nonoperational”  and  “operational”.  In  problems  such  as 
these  the  state  description  is  chosen  to  represent  meaningful  attributes  of  inter¬ 
est.  One  last  comment  concerns  our  apparent  preoccupation  with  the  steady-state 
behavior  of  a  Markov  chain.  Although  not  always  true,  we  are  many  times  only 
interested  in  this  because  the  choice  of  a  starting  time,  i.e.,  at  n  =  0,  is  not  easy 
to  specify.  In  the  previous  example,  it  is  conceivable  that  the  machine  in  question 
has  been  in  operation  for  a  long  time  and  it  is  only  recently  that  a  plant  manager 
has  become  interested  in  its  failure  rate.  Therefore,  its  initial  starting  time  was 
probably  some  time  in  the  past  and  we  are  now  observing  the  states  for  some  large 
n.  We  continue  our  discussion  of  steady-state  characteristics  in  the  next  section. 


22.6  Further  Steady-State  Characteristics 

22.6.1  State  Occupation  Time 

It  is  frequently  of  interest  to  be  able  to  determine  the  percentage  of  time  that  a 
Markov  chain  is  in  a  particular  state,  also  called  the  state  occupation  time.  Such 
was  the  case  in  Example  22.7,  although  a  careful  examination  reveals  that  what  we 
actually  computed  was  the  probability  of  being  operational  at  the  beginning  of  each 
day.  In  essence  we  are  now  asking  for  the  relative  frequency  (or  percentage  of  time) 
of  the  machine  being  operational.  This  is  much  the  same  as  asking  for  the  relative 
frequency  of  heads  in  a  long  sequence  of  independent  fair  coin  tosses.  We  have 
proven  by  the  law  of  large  numbers  (see  Chapter  15)  that  this  relative  frequency 
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must  approach  a  probability  of  1/2  as  the  number  of  coin  tosses  approaches  infinity. 
For  Markov  chains  the  trials  are  not  independent  and  so  the  law  of  large  numbers 
does  not  apply  directly.  However,  as  we  now  show,  if  steady-state  is  attained,  then 
the  fraction  of  time  the  Markov  chain  spends  in  a  particular  state  approaches  the 
steady-state  probability.  This  allows  us  to  say  that  the  fraction  of  time  that  the 
Markov  chain  spends  in  state  j  is  just  itj. 

Again  consider  a  two-state  Markov  chain  with  states  0  and  1  and  assume  that 
P  >  0.  We  wish  to  determine  the  fraction  of  time  spent  in  state  1.  For  some  large 
n  this  is  given  by 

1  n+N—l 

X  E 

j-n 

which  is  recognized  as  the  sample  mean  of  the  N  state  outcomes  for  {X[ri\,X[n  + 
1], . . . ,  X[n  +  N  —  1]}.  We  first  determine  the  expected  value  as 

1  n+N—l 

E  jv  E  m 

j—n 


But 

E[X\j]\X[0]  =  i]  =  P[X\j]  =  1|X[0]  =  i] 

=  [P'h-nn 

as  j  >  n  — >•  oo  which  follows  from  (22.19).  The  expected  value  does  not  depend 
upon  the  initial  state  i.  Therefore,  we  have  from  (22.20)  that 

^  n+N—l  n+N—l 

B  jv  E  xw  -> j^  E  *1  ='■• 

j=n  j—n 

Thus,  as  ti  — y  oo,  the  expected  fraction  of  time  in  state  1  is  7Ti.  Furthermore,  although 
it  is  more  difficult  to  show,  the  variance  of  the  sample  mean  converges  to  zero  as 
N  y  oo  so  that  the  fraction  of  time  (and  not  just  the  expected  value)  spent  in  state 
1  will  converge  to  or 

1  n+N—l 

Tj  xU\-*n<  (22.21) 

j—n 

This  is  the  same  result  as  for  the  repeated  independent  tossing  of  a  fair  coin.  The  re¬ 
sult  stated  in  (22.21)  is  that  the  temporal  mean  is  equal  to  the  ensemble  mean  which 
says  that  for  large  n,  i.e.,  in  steady-state,  jj  YJjln1  xl 1]  — >■  ^1  as  TV  oo.  This 


'X[0] 


^  n+N—l 

-  £  X[j]  X[0]  =  i 

j—n 


'X[0] 


n+N—l 


1  E[X{j}\X[0}  =  i] 


j=n 


(22.20) 
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is  the  property  of  ergodicity  as  previously  described  in  Chapter  17.  Thus,  a  Markov 
chain  that  achieves  a  steady-state  irregardless  of  the  initial  state  probabilities  is 
called  an  ergodic  Markov  chain. 

Returning  to  our  golfing  friend,  we  had  previously  questioned  the  fraction  of 
the  time  she  will  achieve  one-putts.  We  know  that  her  stationary  probability  is 
7Ti  —  1/3.  Thus,  after  playing  many  rounds  of  golf,  she  will  be  one-putting  about 
1/3  of  the  time. 


22.6.2  Mean  Recurrence  Time 


Another  property  of  the  ergodic  Markov  chain  that  is  of  interest  is  the  average 
number  of  steps  before  a  state  is  revisited.  For  example,  the  golfer  may  wish  to 
know  the  average  number  of  holes  she  will  have  to  play  before  another  one-putt 
occurs,  given  that  she  has  just  one-putted.  This  is  equivalent  to  determining  the 
average  number  of  steps  the  Markov  chain  will  undergo  before  it  returns  to  state 
1.  The  time  between  visits  to  the  same  state  is  called  the  recurrence  time  and  the 
average  of  this  is  called  the  mean  recurrence  time.  We  next  determine  this  average. 

Let  Tr  denote  the  recurrence  time  and  note  that  it  is  an  integer  random  variable 
that  can  take  on  values  in  the  sample  space  {1, 2, . . .}.  For  the  two-state  Markov 
chain  shown  in  Figure  22.3  we  first  assume  that  we  are  in  state  1  at  time  n  =  no- 
Then,  the  value  of  the  recurrence  time  will  be  1,  or  2,  or  3,  etc.  if  X[no  +  1]  =  1, 
or  X[no  +  1]  =  0,  X[no  +  2]  =  1,  or  X[tiq  +  1]  =  0,  X[uq  +  2]  =  0,  X\no  +  3]  =  1, 
etc.,  respectively.  The  probabilities  of  these  events  are  1  —  /3,  /3a,  and  /3(1  —  a) a, 
respectively  as  can  be  seen  by  referring  to  Figure  22.3.  In  general,  the  PMF  is  given 
as 


P[Tr  =  k | initially  in  state  1]  = 


1-/3  k  =  1 

pa(l  -  a)k~2  k>  2 


which  is  a  geometric- type  PMF  (see  Chapter  5).  To  find  the  mean  recurrence  time 
we  need  only  determine  the  expected  value  of  Tr.  This  is 


oc 


E[Tr | initially  in  state  1]  =  (1  —  /?)  +  ^  k  /3a(l  —  a) 


fc= 2 


,fc- 2 


OO 


=  (1  —  (3)  +  a/3 ^(Z  +  1)(1  —  a)1  1  (let  l  —  k  —  1) 


1=1 


=  (1-/5)  + 


OO 


OO 


a/3^(l  —  a)1  1  +  /3  Za(l  —  a)1  1 

i— i  /—I  v  v  / 


geom(a)  PMFJ 


(1  -0)  +  <*/? 
a  +  / 3 


1 


1-(1- 


N  +  /3—  (from  Section  6.4.3) 
a)  a 


a 
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so  that  we  have  finally 

1 

E[Tr | initially  in  state  1]  =  — . 

7T1 

It  is  seen  that  mean  recurrence  time  is  the  reciprocal  of  the  stationary  state  prob¬ 
ability.  This  is  much  the  same  result  as  for  a  geometric  PMF  and  is  interpreted  as 
the  number  of  failures  (not  returning  to  state  1)  before  a  success  (returning  to  state 
1).  For  our  golfer,  since  she  has  a  stationary  probability  of  one-putting  of  1/3,  she 
must  wait  on  the  average  1/ (l/3)=3  holes  between  one-putts.  This  agrees  with  our 
simulation  results  shown  in  Figure  22.1. 


22.7  iT-State  Markov  Chains 


Markov  chains  with  more  than  two  states  are  quite  common  and  useful  in  practice 
but  their  analysis  can  be  difficult.  Most  of  the  previous  properties  of  a  Markov 
chain  apply  to  any  finite  number  K  of  states.  Computation  of  the  n-step  transition 
probability  matrix  is  of  course  more  difficult  and  requires  computer  evaluation.  Most 
importantly,  however,  is  that  steady-state  is  still  attained  if  P  >0.  The  solution 
for  the  stationary  probabilities  is  given  next.  It  is  derived  in  Appendix  22 A. 

The  stationary  probability  vector  for  a  A-state  Markov  chain  is  7 rT  =  [710  iri . . . 
ttk-  1].  Its  solution  is  given  as 

7T  =  (I-PT +  11T)~11  (22.22) 


where  I  is  the  A  x  A  identity  matrix  and  1  =  [1 1  ...  1]T,  which  is  a  A  x  1  vector 
of  ones.  We  next  give  an  example  of  a  3-state  Markov  chain. 

Example  22.8  —  Weather  modeling 

Assume  that  the  weather  for  each  day  can  be  classified  as  being  either  rainy  (state 
0),  cloudy  (state  1),  or  sunny  (state  2).  We  wish  to  determine  in  the  long  run 
(steady-state)  the  percentage  of  sunny  days.  From  the  discussion  in  Section  22.6.1 
this  is  the  state  occupation  time,  and  is  equal  to  the  stationary  probability  7T2-  To 
do  so  we  assume  the  conditional  probabilities 


/ 1 

currently  raining  (state  0)  :  P00  =  P01  =  P02  =  - 


currently  cloudy  (state  1)  :  Pio  = 
currently  sunny  (state  2)  :  P2q  = 


8 


8 

8 ,F>12 


1 

8’ 


P21 


8 


>  P22  =  x 


1 

8 

3 

8 

4 

8 


This  says  that  if  it  is  currently  raining,  then  it  is  most  probable  that  the  next  day 
will  also  have  rain  (4/8).  The  next  most  probable  weather  condition  will  be  cloudy 
for  the  next  day  (3/8),  and  the  least  probable  weather  condition  is  sunny  for  the 
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next  day  (1/8).  See  if  you  can  rationalize  the  other  entries  in  P.  The  complete  state 
transition  probability  matrix  is 

r  4  3  l  -I 

8  8  8 

p  —  3  2  3 

17  ~~  8  8  8 

13  4 

8  8  8  J 

and  the  state  probability  diagram  is  shown  in  Figure  22.8.  We  can  use  this  to 


l 

8 


1  :  cloudy 


2  :  sunny 


Figure  22.8:  Three-state  probability  diagram  for  weather  example. 


determine  the  probability  of  the  weather  conditions  on  any  day  if  we  know  the 
weather  on  day  n  =  0.  For  example,  to  find  the  probability  of  the  weather  on 
Saturday  knowing  that  it  is  raining  on  Monday,  we  use 

p  T[n\  =  pT[0]Pn 

with  n  =  5  and  pT[0]  =  [100].  Using  a  computer  to  evalute  this  we  have  that 


’  0.3370  " 
0.3333 
0.3296 


and  it  appears  that  the  possible  weather  conditions  are  nearly  equiprobable.  To  find 
the  stationary  probabilities  for  the  weather  conditions  we  must  solve  i rT  =  7ttP. 
Using  the  solution  of  (22.22),  we  find  that 


7 r  = 
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As  n  -*  oo,  it  is  equiprobable  that  the  weather  will  be  rainy,  cloudy,  or  sunny. 
Furthermore,  because  of  ergodicity  the  fraction  of  days  that  it  will  be  rainy,  or  be 
cloudy,  or  be  sunny  will  all  be  1/3. 

0 

The  previous  result  that  the  stationary  probabilities  are  equal  is  true  in  general  for 
the  type  of  transition  probability  matrix  given.  Note  that  P  not  only  has  all  its  rows 
summing  to  one  but  also  its  column  entries  sum  to  one  for  all  the  columns.  This  is 
called  a  doubly  stochastic  matrix  and  always  results  in  equal  stationary  probabilities 
(see  Problem  22.27). 


22.8  Computer  Simulation 

The  computer  simulation  of  a  Markov  chain  is  very  simple.  Consider  the  weather 
example  of  the  previous  section.  We  first  need  to  generate  a  realization  of  a  random 
variable  taking  on  the  values  0,1,2  with  the  PMF  Po[0],Pi[0],P2[0].  This  can  be 
done  using  the  approach  of  Section  5.9.  Once  the  realization  has  been  obtained, 
say  #[0]  =  i,  we  continue  the  same  procedure  but  must  choose  the  next  PMF,  which 
is  actually  a  conditional  PMF.  If  x[0]  =  i  =  1  for  example,  then  we  use  the  PMF 
p[0|l]  =  Pio,p[l|l]  =  Pn,p[2|l]  =  P12,  which  are  just  the  entries  in  the  second  row 
of  P.  We  continue  this  procedure  for  all  n  >  1.  Some  MATLAB  code  to  generate  a 
realization  for  the  weather  example  is  given  below. 


clear  all 
rand(* state } ,0) 

N=1000;  7,  set  number  of  samples  desired 

p0=[l/3  1/3  1/3]  ^ ;  7.  set  initial  probability  vector 

P=[4/8  3/8  1/8;  3/8  2/8  3/8;  1/8  3/8  4/8];  7*  set  transition  prob.  matrix 
xi=[0  1  2]  ’ ;  7.  set  values  of  PMF 

X0=PMFdata(l ,xi,p0) ;  7*  generate  X[0]  (see  Appendix  6B  for  PMFdata.m 

7,  function  subprogram) 
i=X0+l;  7o  choose  appropriate  row  for  PMF 
X(l,l)=PMFdata(l,xi,P(i, :)) ;  7#  generate  X[l] 
i=X  ( 1 , 1) +1 ;  7.  choose  appropriate  row  for  PMF 
for  n=2:N  7*  generate  X[n] 

i=X(n-l,l)+l;  7#  choose  appropriate  row  for  PMF 
X(n,l)=PMFdata(l,xi,P(i, :)); 

end 

The  reader  may  wish  to  modify  and  run  this  program  to  gain  some  insight  into  the 
effect  of  the  conditional  probabilities  on  the  predicted  weather  patterns. 
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22.9  Real-World  Example  -  Strange  Markov 
Chain  Dynamics 

It  is  probably  fitting  that  as  the  last  real-world  example,  we  choose  one  that  ques¬ 
tions  what  the  real-world  actually  is.  Is  it  a  place  of  determinism,  however  complex, 
or  one  that  is  subject  to  the  whims  of  chance  events?  Random ,  as  defined  by  Web¬ 
ster’s  dictionary,  means  “lacking  a  definite  plan,  purpose,  or  pattern ” .  Is  this  a  valid 
definition?  We  do  not  plan  to  answer  this  question,  but  only  to  present  some  “food 
for  thought” .  The  seemingly  random  Markov  chain  provides  an  interesting  example. 

Consider  a  square  arrangement  of  101  x  101  points  and  define  a  set  of  states 
as  the  locations  of  the  integer  points  within  this  square.  The  points  are  therefore 
denoted  by  the  integer  coordinates  (i,  j),  where  i  =  0, 1, . . . ,  100;  j  =  0, 1, ... ,  100. 
The  number  of  states  is  K  —  1012.  Next  define  a  Markov  chain  for  this  set  of  states 
such  that  the  nth  outcome  is  a  realization  of  the  random  point  X[n]  =  [I[n]  J[n]]T, 
where  I[n ]  and  J[n\  are  random  variables  taking  on  integer  values  in  the  interval 
[0,100].  The  initial  point  is  chosen  to  be  X[0]  =  [1080]T  and  succeeding  points 
evolve  according  to  the  random  process: 

1.  Choose  at  random  one  of  the  reference  points  (0, 0),  (100, 0),  (50, 100). 

2.  Find  the  midpoint  between  the  initial  point  and  the  chosen  reference  point  and 

round  it  to  the  nearest  integer  coordinates  (so  that  it  becomes  a  state  output). 

3.  Replace  the  initial  point  with  the  one  found  in  step  2. 

4.  Go  to  step  1  and  repeat  the  process,  always  using  the  previous  point  and  one  of 

the  reference  points  chosen  at  random. 

This  procedure  is  equivalent  to  the  formula 

X[n]  =  \(X[n  -  1]  +  R[n])  n  >  1  (22.23) 

where  R[n]  =  [ri[n]r2[n]]T  is  the  reference  point  chosen  at  random  and  [•] round 
denotes  rounding  of  both  elements  of  the  vector  to  the  nearest  integer.  Note  that 
this  is  a  Markov  chain.  The  points  generated  must  all  lie  within  the  square  at  integer 
coordinates  due  to  the  averaging  and  rounding  that  is  ongoing.  Also,  the  current 
output  only  depends  upon  the  previous  output  X[n  —  1],  i.e.,  justifying  the  claim 
of  a  Markov  chain.  The  process  is  “random”  due  to  our  choice  of  R [n]  from  the 
sample  space  {(0,0),  (100,0),  (50, 100)}  with  equal  probabilities. 

The  behavior  of  this  Markov  chain  is  shown  in  Figure  22.9,  where  the  successive 
output  points  have  been  plotted  with  the  first  few  shown  with  their  values  of  n. 
It  appears  that  the  chain  attains  a  steady-state  and  its  steady-state  PMF  is  zero 
over  many  triangular  regions.  It  is  interesting  to  note  that  the  pattern  consists  of 
3  triangles — one  with  vertices  (0,0),  (50, 0),  (25, 50),  and  the  others  with  vertices 
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(50, 0),  (100, 0),  (75, 50),  and  (25, 50),  (75, 50),  (50, 100).  Within  each  of  these  trian¬ 
gles  resides  an  exact  replica  of  the  whole  pattern  and  within  each  replica  resides 
another  replica,  etc.!  Such  a  figure  is  called  a  fractal  with  this  particular  one  termed 
a  Sierpinski  triangle.  The  MATLAB  code  used  to  produce  this  figure  is  given  below. 
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Figure  22.9:  Steady-state  Markov  chain. 

7*  sierpinski. m 

7. 

clear  all 
rand (’ state * ,0) 

r(:,l)  =  [0  0]’;  7*  set  up  reference  points 

r(  :  ,2)  =  [100  0]  > ; 

r ( : ,3)= [50  100] ^ ; 

x0=[10  80]  } ;  7o  set  initial  state 

plot  (x0(l)  ,x0(2) , } .  ’)  7o  plot  state  outcome  as  point 
axis(  [0  100  0  100]) 
hold  on 
xn_l=x0 ; 
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for  n=l:  10000  */.  generate  states 

j=floor(3*rand(l,l)+l) ;  7.  choose  at  random  one  of  three 

7,  reference  points 

xn=round(0.5*(r(  : ,  j)+xn_l))  ;  7®  generate  new  state 
plot  (xn(l)  ,xn(2) , ; . } )  7#  plot  state  outcome  as  point 
xn_l=xn;  %  make  current  state  the  previous  one  for 
7.  next  transition 

end 
grid 
hold  off 

The  question  arises  as  to  whether  the  Markov  chain  is  deterministic  or  random. 
We  choose  not  to  answer  this  question  (because  we  don’t  know  the  answer!).  Instead 
we  refer  the  interested  reader  to  the  excellent  book  [Peitgen,  Jurgens,  and  Saupe 
1992]  and  also  the  popular  layman’s  account  [Gleick  1987]  for  further  details.  As 
a  more  practical  application,  it  is  observed  that  seemingly  complex  figures  can  be 
generated  using  a  simple  algorithm.  This  leads  to  the  idea  of  data  compression  in 
which  the  only  information  needed  to  store  a  complex  figure  is  the  details  of  the 
algorithm.  A  field  of  sunflowers  is  such  an  example  for  which  the  reader  should 
consult  [Barnsley  1988]  on  how  this  is  done. 
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Problems 

22.1  (w)  A  Markov  chain  has  the  states  “A”  and  “B”  or  equivalently  0  and  1.  If 
the  conditional  probabilities  are  P[A|P]  =  0.1  and  P[P|A]  =  0.4,  draw  the 
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state  probability  diagram.  Also,  find  the  transition  probability  matrix. 

22.2  (^)  (f)  For  the  state  probability  diagram  shown  in  Figure  22.2  find  the  prob¬ 
ability  of  obtaining  the  outcomes  X[n\  =  0, 1, 0, 1, 1  for  n  =  0, 1, 2, 3, 4,  respec¬ 
tively. 

22.3  (f)  For  the  state  probability  diagram  shown  in  Figure  22.3  find  the  probabil¬ 
ities  of  the  outcomes  X[n\  —  0, 1, 0, 1, 1, 1  for  n  =  0, 1, 2, 3, 4, 5,  respectively 
and  also  for  X[n\  =  1, 1, 0, 1, 1, 1  for  n  =  0, 1, 2, 3, 4, 5,  respectively.  Compare 
the  two  and  explain  the  difference. 

22.4  (w)  In  some  communication  systems  it  is  important  to  determine  the  percent¬ 
age  of  time  a  person  is  talking.  From  measurements  it  is  found  that  if  a  person 
is  talking  in  a  given  time  interval,  then  he  will  be  talking  in  the  next  time  in¬ 
terval  with  a  probability  of  0.75.  If  he  is  not  talking  in  a  time  interval,  then 
he  will  be  talking  in  the  next  time  interval  with  a  probability  of  0.5.  Draw  the 
state  probability  diagram  using  the  states  “talking”  and  “not  talking” . 

22.5  (^)  (t)  In  this  problem  we  give  an  example  of  a  random  process  that  does 
not  have  the  Markov  property.  The  random  process  is  defined  as  an  exclusive 
OR  logical  function.  This  is  Y[n]  =  X[n\  ©  X[n  —  1]  for  n  >  0,  where  X[n\ 
for  n  >  0  takes  on  values  0  and  1  with  probabilities  1  —  p  and  p,  respectively. 
The  X[n]’s  are  IID.  Also,  for  n  =  0  we  define  Y[0]  =  X[Q].  The  definition 
of  this  operation  is  that  Y[n]  =  0  only  if  X[n\  and  X[n  —  1]  are  the  same 
(both  equal  to  0  or  both  equal  to  1),  and  otherwise  Y[n ]  =  1.  Determine 
P[Y[2]  =  1|Y[1]  =  1,  Y[0]  =  0]  and  P[Y[2]  =  1|Y[1]  =  1]  to  show  that  they 
are  not  equal  in  general. 

22.6  (f)  For  the  transition  probability  matrix  given  below  draw  the  corresponding 
state  probability  diagram. 

1  l  in 

2  4  4 

■p  _  1.  1  1 

^  ~  3  3  3 

2  11 

3  6  6  J 

22.7  (w)  A  fair  die  is  tossed  many  times  in  succession.  The  tosses  are  independent 
of  each  other.  Let  X[n]  denote  the  maximum  of  the  first  n  +  1  tosses.  De¬ 
termine  the  transition  probability  matrix.  Hint:  The  maximum  value  cannot 
decrease  as  n  increases. 

22.8  (w)  A  particle  moves  along  the  circle  shown  in  Figure  22.10  from  one  point 
to  the  other  in  a  clockwise  (CW)  or  counterclockwise  (CCW)  direction.  At 
each  step  it  can  move  either  CW  1  unit  or  CCW  1  unit.  The  probabilities 
are  P[CCW]  =  p  and  P[CW]  =  1  —  p  and  do  not  depend  upon  the  current 
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Figure  22.10:  Movement  of  particle  along  a  circle  for  Problem  22.8. 

location  of  the  particle.  For  the  states  0, 1, 2, 3  find  the  transition  probability 
matrix. 

22.9  (^)  (w,c)  A  digital  communication  system  transmits  a  0  or  a  1.  After  10 
miles  of  cable  a  repeater  decodes  the  bit  and  declares  it  either  a  0  or  a  1.  The 
probability  of  a  decoding  error  is  0.1  as  shown  schematically  in  Figure  22.11. 
It  is  then  retransmitted  to  the  next  repeater  located  10  miles  away.  If  the 
repeaters  are  all  located  10  miles  apart  and  the  communication  system  is  50 
miles  in  length,  find  the  probability  of  an  error  if  a  0  is  initially  transmitted. 
Hint:  You  will  need  a  computer  to  work  this  problem. 


Figure  22.11:  One  section  of  a  communication  link. 

22.10  (w,c)  If  a  =  (3  =  1/4  for  the  state  probability  diagram  shown  in  Figure  22.3, 
determine  n  so  that  the  Markov  chain  is  in  steady-state.  Hint:  You  will  need 
a  computer  to  work  this  problem. 

22.11  (^)  (w)  There  are  two  urns  filled  with  red  and  black  balls.  Urn  1  has  60% 
red  balls  and  40%  black  balls  while  urn  2  has  20%  red  balls  and  80%  black 
balls.  A  ball  is  drawn  from  urn  1,  its  color  noted,  and  then  replaced.  If  it  is 
red,  the  next  ball  is  also  drawn  from  urn  1,  its  color  noted  and  then  replaced. 
If  the  ball  is  black,  then  the  next  ball  is  drawn  from  urn  2,  its  color  noted 
and  then  replaced.  This  procedure  is  continued  indefinitely.  Each  time  a  ball 
is  drawn  the  next  ball  is  drawn  from  urn  1  if  the  ball  is  red  and  from  urn  2 
if  it  is  black.  After  many  trials  of  this  experiment  what  is  the  probability  of 
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drawing  a  red  ball?  Hint:  Define  the  states  1  and  2  as  urns  1  and  2  chosen. 
Also,  note  that  P[red  drawn]  =  P[red  drawn|urn  1  chosen]P[urn  1  chosen]  + 
P[red  drawn|urn  2  chosen]P[urn  2  chosen]. 

22.12  (^)  (w)  A  contestant  answers  questions  posed  to  him  from  a  game  show 
host.  If  his  answer  is  correct,  the  game  show  host  gives  him  a  harder  question 
for  which  his  probability  of  answering  correctly  is  0.01.  If  however,  his  answer 
is  incorrect,  the  contestant  is  given  an  easy  question  for  which  his  probability 
of  answering  correctly  is  0.99.  After  answering  many  questions,  what  is  the 
probability  of  answering  a  question  correctly? 

22.13  (f)  For  the  transition  probability  matrix 

'  h  h  0  ' 

p=  i  !  i 

1  0  1 

L  2  u  2  j 

will  Pn  converge  as  n  4  oo?  You  should  be  able  to  answer  this  question 
without  the  use  of  a  computer.  Hint:  Determine  P2. 

22.14  (o)  (w,c)  For  the  transition  probability  matrix 

5  5  0  0‘ 

I  I  0  0 

I  I  I  I 

4  4  4  4 

1111 
4  4  4  4  . 

does  the  Markov  chain  attain  steady-state?  If  it  does,  what  are  the  steady- 
state  probabilities?  Hint:  You  will  need  a  computer  to  evaluate  the  answer. 

22.15  (w,c)  There  are  three  lightbulbs  that  are  always  on  in  a  room.  At  the  begin¬ 
ning  of  each  day  the  custodian  checks  to  see  if  at  least  one  lightbulb  is  working. 
If  all  three  lightbulbs  have  failed,  then  he  will  replace  them  all.  During  the  day 
each  lightbulb  will  fail  with  a  probability  of  1  /2  and  the  failure  is  independent 
of  the  other  lightbulbs  failing.  Letting  the  state  be  the  number  of  working 
lightbulbs  draw  the  state  probability  diagram  and  determine  the  transition 
probability  matrix.  Show  that  eventually  all  three  bulbs  must  fail  and  the 
custodian  will  then  have  to  replace  them.  Hint:  You  will  need  a  computer  to 
work  this  problem. 

22.16  (f)  Find  the  stationary  probabilities  for  the  transition  probability  matrix 
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22.17  (t)  In  this  problem  we  discuss  the  proof  of  the  property  that  if  P  >  0,  the 
rows  of  Pn  will  all  converge  to  the  same  values  and  that  these  values  are  the 
stationary  probabilities.  We  consider  the  case  of  K  =  3  for  simplicity  and 
assume  distinct  eigenvalues.  Then,  it  is  known  from  the  Perron-Frobenius 
theorem  that  we  will  have  the  eigenvalues  Ai  =  1,  |  A2 1  <  15  and  |  A3 1  <  1. 
Prom  (22.12)  we  have  that  Pn  =  VAnV-1  which  for  K  =  3  is 


'  1 
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tli-H 
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1 _ 

pn  =  [  Vl  v2  V3  ] 
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\n 
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* 

00 
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where  W  —  V-1  and  w J  is  the  ith  row  of  W.  Next  argue  that  as  n  00, 
Pn  viw[.  Use  the  relation  P°°l  =  1  (why?)  to  show  that  w\  =  cl,  where 
c  is  a  constant.  Next  use  7rTP°°  =  nT  (why?)  to  show  that  wi  =  d7r,  where 
d  is  a  constant.  Finally,  use  the  fact  that  wfvi  =  1  since  WV  =  I  to  show 
that  cd  —  1  and  therefore,  P°°  =  1ttt.  The  latter  is  the  desired  result  which 
can  be  verified  by  direct  multiplication  of  1  by  irT. 

22.18  (f,c)  For  the  transition  probability  matrix 


0.1 

0.4 

0.5 

0.2 

0.5 

0.3 

0.3 

0.3 

0.4 

find  P100  using  a  computer  evaluation.  Does  the  form  of  P100  agree  with  the 
theory? 

22.19  (^)  (f,e)  Using  the  explicit  solution  for  the  stationary  probability  vector 
given  by  (22.22),  determine  its  value  for  the  transition  probability  matrix  given 
in  Problem  22.18.  Hint:  You  will  need  a  computer  to  evaluate  the  solution. 


22.20  (w)  The  result  of  multiplying  two  identical  matrices  together  produces  the 
same  matrix  as  shown  below. 
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Explain  what  this  means  for  Markov  chains. 

22.21  (f)  For  the  transition  probability  matrix 


0.99  0.01 
0.01  0.99 
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solve  for  the  stationary  probabilities.  Compare  your  probabilities  to  those  ob¬ 
tained  if  a  fair  headed  coin  is  tossed  repeatedly  and  the  tosses  are  independent. 
Do  you  expect  the  realization  for  this  Markov  chain  to  be  similar  to  that  of 
the  fair  coin  tossing? 

22.22  (c)  Simulate  on  the  computer  the  Markov  chain  described  in  Problem  22.21. 
Use  pr[0]  =  [1/2  1/2]  for  the  initial  probability  vector.  Generate  a  realization 
for  n  —  0, 1, . . . ,  99  and  plot  the  results.  What  do  you  notice  about  the  real¬ 
ization?  Next  generate  a  realization  for  n  —  0, 1, . . .  ,9999  and  estimate  the 
stationary  probability  of  observing  1  by  taking  the  sample  mean  of  the  real¬ 
ization.  Do  you  obtain  the  theoretical  result  found  in  Problem  22.21  (recall 
that  this  type  of  Markov  chain  is  ergodic  and  so  a  temporal  average  is  equal 
to  an  ensemble  average). 

22.23  (w)  A  person  is  late  for  work  on  his  first  day  with  a  probability  of  0.1.  On 
succeeding  days  he  is  late  for  work  with  a  probability  of  0.2  if  he  was  late  the 
previous  day  and  with  a  probability  of  0.4  if  he  was  on  time  the  previous  day. 
In  the  long  run  what  percentage  of  time  is  he  late  to  work? 

22.24  (o)  (f?c)  Assume  for  the  weather  example  of  Example  22.8  that  the  transi¬ 
tion  probability  matrix  is 

6  1  1  I 

8  8  8 

P  —  5  2  1 

^  ~  8  8  8 

4  3  1 

8  8  8  J 

What  is  the  steady-state  probability  of  rain?  Compare  your  answer  to  that 
obtained  in  Example  22.8  and  explain  the  difference.  Hint:  You  will  need  a 
computer  to  find  the  solution. 

22.25  (w,c)  Three  machines  operate  together  on  a  manufacturing  floor,  and  each 
day  there  is  a  possibility  that  any  of  the  machines  may  fail.  The  probability  of 
their  failure  depends  upon  how  many  other  machines  are  still  in  operation.  The 
number  of  machines  in  operation  at  the  beginning  of  each  day  is  represented  by 
the  state  values  of  0, 1, 2, 3  and  the  corresponding  state  transition  probability 
matrix  is 

’  1  0  0  0  ' 

0.5  0.5  0  0 

0.1  0.3  0.6  0 

_  0.4  0.3  0.2  0.1  _ 

First  explain  why  P  has  zero  entries.  Next  determine  how  many  days  will  pass 
before  the  probability  of  all  3  machines  failing  is  greater  than  0.8.  Assume 
that  intially  all  3  machines  are  working.  Hint:  You  will  need  a  computer  to 
find  the  solution. 
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22.26  (^)  (w,c)  A  pond  holds  4  fish.  Each  day  a  fisherman  goes  fishing  and  his 
probability  of  catching  k  =  0, 1,2, 3, 4  fish  that  day  follows  a  binomial  PDF 
with  p  =  1/2.  How  many  days  should  he  plan  on  fishing  so  that  the  probability 
of  his  catching  all  4  fish  exceeds  0.9?  Note  that  initially,  i.e.,  at  n  =  0,  all  4 
fish  are  present.  Hint:  You  will  need  a  computer  to  find  the  solution. 

22.27  (t)  In  this  problem  we  prove  that  a  doubly  stochastic  transition  probability 
matrix  with  P  >  0  produces  equal  stationary  probabilities.  First  recall  that 
since  the  columns  of  P  sum  to  one,  we  have  that  PT1  =  1  and  therefore  argue 
that  p°°Tl  =  1.  Next  use  the  results  of  Problem  22.17  that  P°°  =  l7rT  to 
show  that  7r  —  1  / K . 

22.28  (o)(c)  Use  a  computer  simulation  to  generate  a  realization  of  the  golf  ex¬ 
ample  for  a  large  number  of  holes  (very  much  greater  than  18).  Estimate  the 
percentage  of  one-putts  from  your  realization  and  compare  it  to  the  theoretical 
results. 

22.29  (c)  Repeat  Problem  22.28  but  now  estimate  the  average  time  between  one- 
putts.  Compare  your  results  to  the  theoretical  value. 

22.30  (c)  Run  the  program  sierpinski.m  given  in  Section  22.9  but  use  instead  the 
initial  position  X[0]  =  [50  30]T.  Do  you  obtain  similar  results  to  those  shown 
in  Figure  22.9?  What  is  the  difference,  if  any? 
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Solving  for  the  Stationary  PMF 


We  derive  the  formula  of  (22.22).  The  set  of  equations  to  be  solved  (after  transpo¬ 
sition)  is  PT7t  =  7r  or  equivalently 

(I  —  Pr)7T  =  0.  (22A.1) 

Since  we  have  assumed  a  unique  solution,  it  is  clear  that  the  matrix  I  —  PT  cannot 
be  invertible  or  else  we  would  have  7r  =  0.  This  is  to  say  that  the  linear  equations 
are  not  all  independent.  To  make  them  independent  we  must  add  the  constraint 
equation  —  1  or  in  vector  form  this  is  1T7t  =  1.  Equivalently,  the  constraint 

equation  is  11T7t  =  1.  Adding  this  to  (22A.1)  produces 

(I  —  PT)7T  +  HT7T  =  1 

or 

(I  —  PT  +  11T)7T  =  1. 

It  can  be  shown  that  the  matrix  I  —  PT  +  11T  is  now  invertible  and  so  the  solution 
is 

7T  =  (I  —  PT  +  11T)_11. 
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Glossary  of  Symbols  and 
Abbrevations 


Symbols 

Boldface  characters  denote  vectors  or  matrices.  All  others  are  scalars.  All  vec¬ 
tors  are  column  vectors.  Random  variables  are  denoted  by  capital  letters  such  as 
E/,  V,  W,  X ,  Y,  Z  and  random  vectors  by  U,  V,  W,  X,  Y,  Z  and  their  values  by  corre¬ 
sponding  lowercase  letters. 


Z 

* 

* 


M 

X+ 

x~ 

Ax  B 

[A]ij 

A(z) 

Mi 

Ber(p) 

bin(M,p) 


c 


co  v(X,Y) 

C 


angle  of 

complex  conjugate 

convolution  operator,  either  convolution  sum  or  integral 

denotes  estimator 

denotes  is  distributed  according  to 

denotes  the  largest  integer  <  x 

denotes  a  number  slightly  larger  than  x 

denotes  a  number  slightly  smaller  than  x 

cartesian  product  of  sets  A  and  B 

(i,j) th  element  of  A 

z-transform  of  a[n]  sequence 

ith  element  of  b 

Bernoulli  random  variable 

binomial  random  variable 

chi-squared  distribution  with  N  degrees  of  freedom 

number  of  combinations  of  N  things  taken  k  at  a  time 

complement  of  set 
covariance  of  X  and  Y 
covariance  matrix 
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Cx 

C  x,Y 
Cx  [«i ,  n2] 
cx(h,h) 

m 

<5[n] 

Sij 

A/ 

At 

Ax 

At 

det(A) 

diag(au, . . .  ,o,xn) 

e* 

V 

m 

E[Xn] 

E[{X  -  E[X])n ] 
Ex[] 

Ex,y  [•] 


Exux2,...,xn[-\ 

Ey\x[Y\X\ 

^Y\x\X\xi] 

Ey\x[Y\x] 

E[X] 

€ 

exp  (A) 

/ 

F 

Fx(x) 

FxHx) 

Fx,y(x,v) 

Fx\  ,...,xN  (xi  5  •  •  •  ? x  tv) 
Fy\x{y\x) 


covariance  matrix  of  X 
covariance  matrix  of  X  and  Y 

covariance  sequence  of  discrete-time  random  process  X[n\ 

covariance  function  of  continuous-time  random  process  X  ( t ) 

Dirac  delta  function  or  impulse  function 

discrete-time  unit  impulse  sequence 

Kronecker  delta 

small  interval  in  frequency  / 

small  interval  in  t 

small  interval  in  x 

time  interval  between  samples 

determinant  of  matrix  A 

diagonal  matrix  with  elements  an  on  main  diagonal 

natural  unit  vector  in  ith  direction 

signal-to-noise  ratio 

expected  value 

nth  moment 

nth  central  moment 

expected  value  with  respect  to  PMF  or  PDF  of  X 
expected  value  with  respect  to  joint  PMF  or 
joint  PDF  of  (X,  Y) 
expected  value  with  respect  to 
TV-dimensional  joint  PMF  or  PDF 
shortened  notation  for  EXux2,...,xN[‘} 
conditional  expected  value  considered  as  random  variable 
expected  value  of  PMF  PY\x[Vj\xi] 
expected  value  of  PDF  Py\x(y\x ) 
expected  value  of  random  vector  X 
element  of  set 

exponential  random  variable 
discrete-time  frequency 
continuous-time  frequency 
cumulative  distribution  function  of  X 
inverse  cumulative  distribution  function  of  X 
cumulative  distribution  function  of  X  and  Y 
cumulative  distribution  function  of  Xi, . . . ,  Xn 
cumulative  distribution  function  of  Y  conditioned 
on  X  =  x 


T  Fourier  transform 

T~l  inverse  Fourier  transform 

#(•)  general  notation  for  function  of  real  variable 

<7_1(-)  general  notation  for  inverse  function  of  g(-) 
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r(x) 

r (a,  A) 

7 x,y(/) 

geom  (p) 

h[n\ 

h{t) 

H(f ) 

H(F) 

'H(z) 

Ia{x ) 

I 

n 

i 

d(w,z) 

d(x,y) 

d(yi,--,yN) 

A 

mse 

MxN 

Px\t) 

(  M  ) 

n 

N\ 

(N)r 

Na 

Af(n,a2) 

C) 

INI 

0 

opt 

1 

Pois(A) 

Px  M 
PxW 

Px,y[^i,yj] 

PX\,...,Xn  [®i j  •  •  • )  ®jv] 

pxN 


Gamma  function 
Gamma  random  variable 

coherence  function  for  discrete-time  random  processes 
X[n\  and  Y[n ] 
geometric  random  variable 
impulse  response  of  LSI  system 
impulse  response  of  LTI  system 
frequency  response  of  LSI  system 
frequency  response  of  LTI  system 
system  function  of  LSI  system 
indicator  function  for  the  set  A 
identity  matrix 
intersection  of  sets 

V~ 1 

Jacobian  matrix  of  transformation  of  w  =  g(x ,  y),z  —  h(x,  y) 

Jacobian  matrix  of  transformation  from  y  to  x 

diagonal  matrix  with  eigenvalues  on  main  diagonal 

mean  square  error 

mean 

mean  sequence  of  discrete-time  random  process  X[n\ 
mean  function  of  continuous-time  random  process  X  ( t ) 
mean  vector 

multinomial  coefficient 

discrete-time  index 
N  factorial 

equal  to  N(N  —  1)  •  •  •  (N  —  r  +  1) 
number  of  elements  in  set  A 

normal  or  Gaussian  random  variable  with  mean  y,  and  variance  a 
multivariate  normal  or  Gaussian  random  vector  with  mean 
and  covariance  C 

Euclidean  norm  or  length  of  vector  x 

null  or  empty  set 

optimal  value 

vector  of  all  ones 

Poisson  random  variable 

PMF  of  X 

PMF  of  integer- valued  random  variable  X  (or  px [i] ,  Px\j]) 

joint  PMF  of  X  and  Y 

joint  PMF  of  Xi, . . .  ,Xjv 

shortened  notation  for  pxi,...,xN  [#i,  • . . ,  ^at] 

joint  PMF  of  integer- valued  random  variables  Xi, . . . ,  X^ 
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PY\x[yj\xi] 

PXJv|Xi,...,XJV-i[a;wl 

X\ ,  .  .  .  ,  X jV — l] 

Px,v[i,j] 

Py\x[M 

Px(x) 

Px,v(x,y) 

Px  1,...,XN(xi,  ■  ■  -,Xn) 

px(x) 

py\x(v\x) 

P[E) 

Pe 

Px(f ) 

Vx(z) 

Px(F) 

PxAf) 

Px,y(F) 

4>x{u) 

<j>x  1,...,XN  (wi, .  .  .  ,Wjv) 
${x) 

Q{x) 

Q_1(u) 

PX,Y 

rx[k\ 

rx(r) 

rx,Y  [&] 

rx,Y(r) 

R  or  R1 
Rn 

Rx 

5 


conditional  PMF  of  Y  given  X  =  X{ 

conditional  PMF  of  X jy  given  Xl,  . . . ,  X^r-i 
joint  PMF  of  integer-valued  random  variables  X  and  Y 
conditional  PMF  of  integer- valued  random  variable  Y 
given  X  =  i 
PDF  of  X 

joint  PDF  of  X  and  Y 

joint  PDF  of  Xi, . . . ,  Xjy 

shortened  notation  for  pxi,...,xN  (#1, . . . ,  xn) 

conditional  PDF  of  Y  given  X  —  x 

probability  of  the  event  E 

probability  of  error 

power  spectral  density  of  discrete-time 
random  process  X[n] 

^-transform  of  autocorrelation  sequence  rx  [k] 
power  spectral  density  of  continuous-time 
random  process  X  ( t ) 

cross-power  spectral  density  of  discrete-time 
random  processes  X[n\  and  Y[n] 
cross-power  spectral  density  of  continuous-time 
random  processes  X  ( t )  and  Y  ( t ) 
characteristic  function  of  X 
joint  characteristic  function  of  X  and  Y 
joint  characteristic  function  of  Xi, . . . ,  Xn 
cumulative  distribution  function  of  Af  (0, 1)  random  variable 
probability  that  a  Af( 0, 1)  random  variable  exceeds  x 
value  of  J\T( 0, 1)  random  variable  that  is  exceeded 
with  probability  of  u 
correlation  coefficient  of  X  and  Y 
autocorrelation  sequence  of  discrete-time 
random  process  X[n] 

autocorrelation  function  of  continuous-time 
random  process  X  ( t ) 

cross-correlation  sequence  of  discrete-time 
random  processes  X[n]  and  Y[n] 
cross-correlation  function  of  continuous-time 
random  processes  X  ( t )  and  Y  ( t ) 
denotes  real  line 

denotes  iV-dimensional  Euclidean  space 
autocorrelation  matrix 
sample  space 


Sx 

Sx,Y 

Sx  1,X2,.-;Xn 

Si 

s 


<4M 

s[n] 

s 

s(t) 

t 

T 

U(a,  b) 

U 

u[n\ 

u(x) 

U(z) 

V 

var(X) 

var(y|a;i) 

Xi 

X 

Xs 

xs 

X[n] 
x[n ] 

X(t) 

x(t) 

X_{z) 

X 

X 

X 

x 

Y\(X  =  Zi) 
Z 

z -1 

0 


sample  space  of  random  variable  X 

sample  space  of  random  variables  X  and  Y 

sample  space  of  random  variables  Xi,  X2, . . . ,  Xjv 

element  of  discrete  sample  space 

element  of  continuous  sample  space 

variance 

variance  of  random  variable  X 

variance  sequence  of  discrete-time  random  process  X 

variance  function  of  continuous-time  random  process 

discrete-time  signal 

vector  of  signal  samples 

continuous-time  signal 

continuous  time 

transpose  of  matrix 

uniform  random  variable  over  the  interval  (a,  b ) 
union  of  sets 

discrete  unit  step  function 
unit  step  function 
z- transform  of  u[n\  sequence 
modal  matrix 
variance  of  X 

variance  of  conditional  PMF  or  of  PY\x[Vj\xi] 
value  of  discrete  random  variable 
value  of  continuous  random  variable 
standardized  version  of  random  variable  X 
value  for  Xs 

discrete-time  random  process 

realization  of  discrete-time  random  process 

continuous-time  random  process 

realization  of  continuous-time  random  process 

^-transform  of  x[n\  sequence 

sample  mean  random  variable 

value  of  X 

random  vector  (Xi,  X^  . . . ,  Xjy) 

value  (#i,  X2, . . . ,  °f  random  vector  X 

random  variable  Y  conditioned  on  X  =  X{ 

z-transform 

inverse  ^-transform 

vector  or  matrix  of  all  zeros 
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Abbreviations 


ACF 

ACS 

AR 

AR(p) 

ARMA 

CCF 

CCS 

CDF 

CPSD 

CTCV 

CTDV 

D/A 

dB 

DC 

DFT 

DTCV 

DTDV 

FFT 

FIR 

GHz 

Hz 

IID 

HR 

KHz 

LSI 

LTI 

MA 

MHz 

MSE 

PDF 

PMF 

PSD 

SNR 

WGN 

WSS 


autocorrelation  function 
autocorrelation  sequence 
autoregressive 

autoregressive  process  of  order  p 
autoregressive  moving  average 
cross-correlation  function 
cross-correlation  sequence 
cumulative  distribution  function 
cross-power  spectral  density 
continuous-time  /  continuous- valued 
continuous-time  /  discrete- valued 
digital-to-analog 
decibel 

constant  level  (direct  current) 
discrete  Fourier  transform 
discrete-time/continuous- valued 
discrete-time/discrete- valued 
fast  Fourier  transform 
finite  impulse  response 
giga-hertz 
hertz 

independent  and  identically  distributed 

infinite  impulse  response 

kilo-hertz 

linear  shift  invariant 
linear  time  invariant 
moving  average 
mega-hertz 
mean  square  error 
probability  density  function 
probability  mass  function 
power  spectral  density 
signal-to-noise  ratio 
white  Gaussian  noise 
wide  sense  stationary 


Appendix  B 

Assorted  Math  Facts  and 
Formulas 


An  extensive  summary  of  math  facts  and  formulas  can  be  found  in  [Gradshteyn  and 
Ryzhik  1994]. 


B.l  Proof  by  Induction 

To  prove  that  a  statement  is  true,  for  example, 

£;  =  f(iv  +  i)  (B.i) 

i=  1 

by  mathematical  induction  we  proceed  as  follows: 

1.  Prove  the  statement  is  true  for  N  =  1. 


2.  Assume  the  statement  is  true  N  =  n  and  prove  that  it  therefore  must  be  true 
for  N  =  n  +  1. 

Obviously,  (B.l)  is  true  for  N  =  1  since  Yll=i  ®  =  1  and  (iV/2)(AT+l)  =  (1/2) (2)  =  1. 
Now  assume  it  is  true  for  N  =  n.  Then  for  N  =  n  +  1  we  have 


n+l 

£• 

1=1 


n 


=  X/+(n+1) 

i=  1 

=  |(n  +  l)  +  (n  +  l) 


n  +  1 


2 

(n  +  1) 


( n  +  2) 


[(n  +  1)  +  1] 


(since  it  is  true  for  N  =  n) 


2 
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which  proves  that  it  is  also  true  for  N  =  n  +  1.  By  induction,  since  it  is  true  for 
N  =  n  =  1  from  step  1,  it  must  also  be  true  for  N  =  (n  +  1)  =  2  from  step  2.  And 
since  it  is  true  for  N  =  n  =  2,  it  must  also  be  true  for  N  =  n  +  1  =  3,  etc. 

B.2  Trigonometry 


Some  useful  trigonometric  identities  are: 

Fundamental 


O  C\ 

sin  a  +  cos  a  =  1 

(B.2) 

Sum  of  angles 

sin(a  +  /3)  =  sin  a  cos  +  cos  a  sin  /3 

(B.3) 

cos  (a  +  j3)  =  cos  a  cos /3  —  sin  ct  sin  ^ 

(B.4) 

Double  angle 

sin(2a)  =  2  sin  a  cos  a 

(B.5) 

cos  (2a)  =  cos2  a  —  sin2  a  =  2  cos2  a  —  1 

(B.6) 

Squared  sine  and  cosine 

.  o  11  /  . 

sin  a  =  -  —  -  cos  (2a) 

(B.7) 

2  1  1  ,  i 

cos  a  =  -  +  -  cos  (2a) 

z  z 

(B.8) 

Euler  identities  For 

exp(ja)  =  cos  a  +  j  sin  a 

(B.9) 

exp  (j  a)  +  exp  (—7  a) 

cos  a  =  - - — - - - - - 

2 

(B.10) 

"in  a  exp(ja)  -  exp(-j'a) 

2j 

(B.ll) 

B.3  Limits 

Alternative  definition  of  exponential  function 

lim  fl  +  -rr)  =  exp(x) 

M— >oo  V  M  / 

(B.12) 

B.4.  SUMS 
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Taylor  series  expansion  about  the  point  x  =  xq 


(  \  9^(Xo)  /  y 

9\x)  =  2_s  — 7i — (*  -  *o) 


(B.13) 


i= o 


where  g^\x o)  is  the  ith  derivative  of  g(x)  evaluated  at  x  =  x'o  and  <p0'  (x‘o)  = 
g(x o).  As  an  example,  consider  g(x)  =  exp(x),  which  when  expanded  about 
x  =  xq  =  0  yields 


OO 

exp(x)  =  ^2 
i= 0 


i\ 


B.4  Sums 

Integers 


N- 1 

Ei  = 

z— 0 

N(N  — 
2 

1) 

(B.14) 

N—l 

E^  = 

z=0 

N(N  — 

1)(2JV  —  1) 

(B.15) 

6 

Real  geometric  series 


N—l 


i—k 


Xk(l  —  XN  k) 


X 


(x  is  real) 


If  \x\  <  1,  then 


OO  jU 

El  x 
x  =  - - 

1  -  X 

l—K 


Complex  geometric  series 


N~l  ,k(  i  ~N—k 


Zk(l-ZN~k) 


Ei  *  \L  ~~  ' 
Z  ~  1  — 

i—k 


(z  is  complex) 


A  special  case  is  when  z  =  exp^’#).  Then 


N—l 


^2  exp(^)  = 


i= 0 


1  —  exp  (jNO) 
1  -  exp(j0) 


exp 


N-  1 


e 


sm 


sin  (f) 


(B.16) 


(B.17) 


(B.18) 


(B.19) 
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If  \z\  =  \x  +  jy\  =  \Jx2  +  y2  <  1,  then  as  N  — >•  oo  (B.18)  becomes 


Double  sums 


OO 


i—k 


1  -z 


M  M 

J2^2ixiVj) 

i= 1  j= 1 


(B.20) 


(B.21) 


B.5  Calculus 


Convergence  of  sum  to  integral 

If  g(x )  is  a  continuous  function  over  [a,  6],  then 

M 

lim  V 

Ax^O^ 

2—0 


g(xi) Ax  =  f  g(x)dx  (B.22) 


where  X{  —  a  +  iAx  and  xm  —  b.  Also,  this  shows  how  to  approximate  an 
integral  by  a  sum. 

Approximation  of  integral  over  small  interval 

rxo+Ax/2 

/  g(x)dx  w  g(xo)Ax  (B.23) 

J  xq—Ax/2 


Differentiation  of  composite  function 


dg(h(x)) 

_  dg{u ) 

dh(x) 

dx 

x=zo  du 

u=h(x0)  dx 

(chain  rule) 


(B.24) 


Change  of  integration  variable 

If  u  =  h(x),  then 

rb  rh~1(<b ) 

/  g(u)du  =  /  g(h(x))h' (x)dx  (B.25) 

J  a  Jh~1(a) 

where  hf(x)  is  the  derivative  of  h(x)  and  h_1(-)  denotes  the  inverse  function. 
This  assumes  that  there  is  one  solution  to  the  equation  u  =  h(x)  over  the 
interval  a  <u  <  b. 

Fundamental  theorem  of  calculus 

d  pTi 

—  J  g(t)dt  =  g{  x)  (B.26) 
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Leibnitz’s  rule 


/•Mv)  rhiiy)  q  dho  ( v) 

'  g(x,y)dx  =  /  —  g{x,y)dx  +  g{h2{y),y)— - — 

fci(v)  Jhi(y)  °y  dv 


g(hi  (y),y) 


dhiiy) 

dy 

(B.27) 


Integration  of  even  and  odd  functions  An  even  function  is  defined  as  having 
the  property  g(—x)  =  g(x ),  while  an  odd  function  has  the  property  g(—x)  = 
—g(x).  As  a  result, 


for  g{x)  an  even  function 
for  g(x)  an  odd  function 


Integration  by  parts 

If  U  and  V  are  both  functions  of  then 


/ 


UdV  =  UV  -  I  VdU 


I 


(B.28) 


Dirac  delta  “function”  or  impulse 

Denoted  by  5(a;)  it  is  not  really  a  function  but  a  symbol  that  has  the  definition 

0  x  7^  0 
oo  x  =  0 

V 

and 

1  0  6  [er,6+] 


<5(z) 


-{ 


/ 


S(x)dx  = 


0  otherwise 

Some  properties  are  for  u(x)  the  unit  step  function 

du{x) 


dx 


/X 

S(t)dt 

-oo 


=  S{x) 


u(x) 


Double  integrals 


lL  g{x)h(y)dxdy  =  (^j  g{x)dx  \  (j  h(y)dy 


References 


(B.29) 


Gradshteyn,  I.S.,  I.M.  Ryzhik,  Tables  of  Integrals,  Series,  and  Products ,  Fifth  Ed., 
Academic  Press,  New  York,  1994. 


Appendix  C 

Linear  and  Matrix  Algebra 


Important  results  from  linear  and  matrix  algebra  theory  are  reviewed  in  this  ap¬ 
pendix.  It  is  assumed  that  the  reader  has  had  some  exposure  to  matrices.  For 
a  more  comprehensive  treatment  the  books  [Noble  and  Daniel  1977]  and  [Graybill 
1969]  are  recommended. 

C.l  Definitions 

Consider  an  M  x  N  matrix  A  with  elements  a^,  i  —  1,2,..,,  M;  j  =  1, 2, . . . ,  N.  A 
shorthand  notation  for  describing  A  is 


Likewise  a  shorthand  notation  for  describing  an  N  x  1  vector  b  is 

[b]i  =  k. 

An  M  x  N  matrix  A  may  multiply  an  N  x  1  vector  b  to  yield  a  new  Mxl  vector 
c  whose  ith  element  is 


N 

C{  —  CL'ijbj  %  —  1,2,...,  ilf. 

3- 1 

Similarly,  an  M  x  N  matrix  A  can  multiply  an  N  x  L  matrix  B  to  yield  an  M  x  L 
matrix  C  =  AB  whose  (i,j)  element  is 

N 

Cij  =  ^  Q’ikbkj  i  —  1,2,...,  Af  5  j  —  15  2, . . . ,  L. 

k= 1 

Vectors  and  matrices  that  can  be  multiplied  together  are  said  to  be  conformable. 
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The  transpose  of  A,  which  is  denoted  by  AT,  is  defined  as  the  N  x  M  matrix 
with  elements  (ij,  or 

CLji- 

A  square  matrix  is  one  for  which  M  =  N.  A  square  matrix  is  symmetric  if 
At  =  A  or  aji  =  aij. 

The  inverse  of  a  square  N  xN  matrix  is  the  square  N  xN  matrix  A-1  for  which 

A-1A  =  AA"1  =  I 


where  I  is  the  N  x  N  identity  matrix.  If  the  inverse  does  not  exist,  then  A  is 
singular.  Assuming  the  existence  of  the  inverse  of  a  matrix,  the  unique  solution  to 
a  set  of  N  simultaneous  linear  equations  given  in  matrix  form  by  Ax  =  b,  where  A 
is  AT  x  AT,  x  is  AT  x  1,  and  b  is  N  x  1,  is  x  =  A_1b. 

The  determinant  of  a  square  N  xN  matrix  is  denoted  by  det(A).  It  is  computed 


as 


N 

det(A)  —  ^  ^  ajj  Cjj 
j= i 


where 

cn  =  (-i  y^Dij. 


Dij  is  the  determinant  of  the  submatrix  of  A  obtained  by  deleting  the  ith.  row  and 
j  th  column  and  is  termed  the  minor  of  aij.  Cij  is  the  cofactor  of  a^.  Note  that  any 
choice  of  i  for  i  =  1, 2, . . . ,  N  will  yield  the  same  value  for  det(A).  A  square  N  x  N 
matrix  is  nonsingular  if  and  only  if  det(A)  ^  0. 

A  quadratic  form  Q,  which  is  a  scalar ,  is  defined  as 


N  N 


aij  x^x 


r 


i= 1  3= 1 


In  defining  the  quadratic  form  it  is  assumed  that  aji  =  a^.  This  entails  no  loss  in 
generality  since  any  quadratic  function  may  be  expressed  in  this  manner.  Q  may 
also  be  expressed  as 

Q  =  xtAx 

where  x  =  [x\  x<i . . .  xn]t  and  A  is  a  square  N  x  N  matrix  with  aji  =  a ij  or  A  is  a 
symmetric  matrix. 

A  square  N  x  N  matrix  A  is  positive  semidefinite  if  A  is  symmetric  and 

Q  =  xrAx  >  0 


for  all  x.  If  the  quadratic  form  is  strictly  positive  for  x  ^  0,  then  A  is  positive 
definite.  When  referring  to  a  matrix  as  positive  definite  or  positive  semidefinite,  it 
is  always  assumed  that  the  matrix  is  symmetric. 


C.2.  SPECIAL  MATRICES 
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A  partitioned  MxN  matrix  A  is  one  that  is  expressed  in  terms  of  its  submatrices. 
An  example  is  the  2x2  partitioning 


An  A12 

A21  A22 


Each  “element”  A ij  is  a  submatrix  of  A.  The  dimensions  of  the  partitions  are  given 
as 


K  x  L  K  x  (N  —  L) 

(M-K)xL  ( M-K)x(N-L ) 


C.2  Special  Matrices 


A  diagonal  matrix  is  a  square  N  x  N  matrix  with  aij  =  0  for  i  ^  j  or  all  elements 
not  on  the  principal  diagonal  (the  diagonal  containing  the  elements  an)  are  zero. 
The  elements  a ij  for  which  i  ^  j  are  termed  the  off-diagonal  elements.  A  diagonal 
matrix  appears  as 

an  0  ...  0 

0  a22  •  •  •  0 


0  0  ...  a^viv 


A  diagonal  matrix  will  sometimes  be  denoted  by  diag(an,  a22,  •  •  •  The  in¬ 

verse  of  a  diagonal  matrix  is  found  by  simply  inverting  each  element  on  the  principal 
diagonal,  assuming  that  an  ^  0  for  i  =  1, 2, . . . ,  N  (which  is  necessary  for  invertibil- 

ity)- 

A  square  N  x  N  matrix  is  orthogonal  if 


For  a  matrix  to  be  orthogonal  the  columns  (and  rows)  must  be  orthonormal  or  if 

A  =  [  ai  a2  ...  b.n  } 


where  a*  denotes  the  ith  column,  the  conditions 


aj  = 


J 


for  i  ^  j 
for  i  =  j 


must  be  satisfied.  Other  “matrices”  that  can  be  constructed  from  vector  operations 
on  the  N  x  1  vectors  x  and  y  are  the  inner  product ,  which  is  defined  as  the  scalar 


xm 


i= 1 


792 


APPENDIX  C.  LINEAR  AND  MATRIX  ALGEBRA 


and  the  outer  product,  which  is  defined  as  the  N  x  N  matrix 


xy- 


x\yi  xiy2 
X2V1  x2y2 

m  m 

•  • 

•  • 

%NV  l  xNy2 


XlVN 

X2VN 

XNVN 


C.3  Matrix  Manipulation  and  Formulas 

Some  useful  formulas  for  the  algebraic  manipulation  of  matrices  are  summarized  in 
this  section.  For  N  x  N  matrices  A  and  B  the  following  relationships  are  useful. 


(AT)-1 

=  (A_1)T 

(AB)-1 

=  B-1A-1 

det(AT) 

=  det(A) 

det(cA) 

=  cN  det(A)  (c 

det(AB) 

=  det(A)det(B) 

det(A-1) 

1 

det(A)  ’ 

Also,  for  any  conformable  matrices  (or  vectors)  we  have 

(AB)r  =  BtAt. 

It  is  frequently  necessary  to  determine  the  inverse  of  a  matrix  analytically.  To  do  so 
one  can  make  use  of  the  following  formula.  The  inverse  of  a  square  N  x  N  matrix  is 

CT 


A"1  = 


det(A) 


where  C  is  the  square  N  x  N  matrix  of  cofactors  of  A.  The  cofactor  matrix  is 
defined  by 

[C]ii  =  (-1  )i+iDij 

where  D,j  is  the  minor  of  aij  obtained  by  deleting  the  ?'th  row  and  jth  column  of 
A. 

Partitioned  matrices  may  be  manipulated  according  to  the  usual  rules  of  matrix 
algebra  by  considering  each  submatrix  as  an  element.  For  multiplication  of  parti¬ 
tioned  matrices  the  submatrices  that  are  multiplied  together  must  be  conformable. 
As  an  illustration,  for  2  x  2  partitioned  matrices 


AB  = 


An  A12 
A21  A22 

AiiBn  +  A12B21 

A21B11  +  A22B21 


Bn 

B12 

.  b2i 

B22 

A11B12  +  Ai2B22 

A21B12  +  A22B22 


C.4.  SOME  PROPERTIES  OF  PD  (PSD)  MATRICES 
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Other  useful  relationships  for  partitioned  matrices  for  an  M  x  N  matrix  A  and  N  x  1 
vectors  x2  are 


Axi  Ax2  . . .  Axw  ]  =  A  [  xi  X2  ...  xat 


(C.l) 


which  is  a  M  x  N  matrix  and 


anxi  a22x2  . . .  clnnx-n  ]  =  [  xi  x2  ...  Xjv  ] 


an  0 

0  a22 

•  • 

•  • 

•  • 

0  0 


which  is  an  N  x  N  matrix. 


0 

0 


ajvjv 

(C.2) 


C.4  Some  Properties  of  Positive  Definite 
(Semidefinite)  Matrices 

Some  useful  properties  of  positive  definite  (semidefinite)  matrices  are: 

1.  A  square  N  x  N  matrix  A  is  positive  definite  if  and  only  if  the  principal  minors 

are  all  positive.  (The  ith  principal  minor  is  the  determinant  of  the  submatrix 
formed  by  deleting  all  rows  and  columns  with  an  index  greater  than  i.)  If  the 
principal  minors  are  only  nonnegative,  then  A  is  positive  semidefinite. 

2.  If  A  is  positive  definite  (positive  semidefinite),  then 

a.  A  is  invertible  (singular). 

b.  the  diagonal  elements  are  positive  (nonnegative). 

c.  the  determinant  of  A,  which  is  a  principal  minor,  is  positive  (nonnegative). 

C.5  Eigendecomposition  of  Matrices 

An  eigenvector  of  a  square  N  x  N  matrix  A  is  an  N  x  1  vector  v  satisfying 

Av  =  Av  (C.3) 

for  some  scalar  A,  which  may  be  complex.  A  is  the  eigenvalue  of  A  corresponding 
to  the  eigenvector  v.  To  determine  the  eigenvalues  we  must  solve  for  the  N  A’s  in 
det(A  —  AI)  =  0,  which  is  an  iVth  order  polynomial  in  A.  Once  the  eigenvalues  are 
found,  the  corresponding  eigenvectors  are  determined  from  the  equation  (A— AI)v  = 
0.  It  is  assumed  that  the  eigenvector  is  normalized  to  have  unit  length  or  vTv  =  1. 

If  A  is  symmetric,  then  one  can  always  find  N  linearly  independent  eigenvectors, 
although  they  will  not  in  general  be  unique.  An  example  is  the  identity  matrix  for 
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which  any  vector  is  an  eigenvector  with  eigenvalue  1.  If  A  is  symmetric,  then  the 
eigenvectors  corresponding  to  distinct  eigenvalues  are  orthonormal  or  vj\j  —  0  for 
i  ^  j  and  vfvj  =  1  for  i  =  j,  and  the  eigenvalues  are  real.  If,  furthermore,  the 
matrix  is  positive  definite  (positive  semidefinite) ,  then  the  eigenvalues  are  positive 
(nonnegative) . 

The  defining  relation  of  (C.3)  can  also  be  written  as  (using  (C.l)  and  (C.2)) 

[  Avi  Av2  . . .  Avat  ]  =  [  Aivi  A2v2  . . .  A nvu  ] 


or 

AV  =  VA  (C.4) 

where 


v  =  [  Vi  v2  ...  vn  ] 

A  =  diag(Ai,A2,...,An). 

If  A  is  symmetric  so  that  the  eigenvectors  corresponding  to  distinct  eigenvalues 
are  orthonormal  and  the  remaining  eigenvectors  are  chosen  to  yield  an  orthonormal 
eigenvector  set,  then  V  is  an  orthogonal  matrix.  As  such,  its  inverse  is  VT,  so  that 
(C.4)  becomes 

A  =  VAVr 

Also,  the  inverse  is  easily  determined  as 

A-1  =  Vt_1A-1V-1 

=  va_1vt. 
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Appendix  D 

Summary  of  Signals,  Linear 
Transforms,  and  Linear  Systems 


In  this  appendix  we  summarize  the  important  concepts  and  formulas  for  discrete¬ 
time  signal  and  system  analysis.  This  material  is  used  in  Chapters  18-20.  Some 
examples  are  given  so  that  the  reader  unfamiliar  with  this  material  should  try  to 
verify  the  example  results.  For  a  more  comprehensive  treatment  the  books  [Jackson 
1991],  [Oppenheim,  Willsky,  and  Nawab  1997],  [Poularikis  and  Seeley  1985]  are 
recommended. 

D.l  Discrete-Time  Signals 

A  discrete-time  signal  is  a  sequence  x[n\  for  n  =  . . . ,  —  1, 0, 1, It  is  defined  only 

for  the  integers.  Some  important  signals  are: 

a.  Unit  impulse  -  x[n\  =  1  for  n  =  0  and  x[n\  =  0  for  n  ^  0.  It  is  also  denoted  by 

5[n]. 

b.  Unit  step  -  x[n]  =  1  for  n  >  0  and  x[n\  =  0  for  n  <  0.  It  is  also  denoted  by  u[n\. 

c.  Real  sinusoid  -  x[n\  =  Acos(27r/on  +  9 )  for  — oo  <  n  <  oo,  where  A  is  the 

amplitude  (must  be  nonnegative),  /o  is  the  frequency  in  cycles  per  sample  and 
must  be  in  the  interval  0  <  /o  <  1/2,  and  9  is  the  phase  in  radians. 

d.  Complex  sinusoid  -  x[n]  =  i4exp(j27r/on  +  6)  for  — oo  <  n  <  oo,  where  A  is  the 

amplitude  (must  be  nonnegative),  /o  is  the  frequency  in  cycles  per  sample  and 
must  be  in  the  interval  — 1/2  <  /o  <  1/2,  and  9  is  the  phase  in  radians. 


e.  Exponential  -  x[n]  =  anu[n ] 
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Note  that  any  sequence  can  be  written  as  a  linear  combination  of  unit  impulses  that 
are  weighted  by  x[k ]  and  shifted  in  time  ;ls  (i[n  —  k]  to  form 

OO 

x[n\  =  ^  #[fc]<$[n  —  k].  (D.l) 

k=— oo 

For  example,  anu[n]  =  5[n]  +  a8[n  —  1]  +  a2S[n  —  2]  +  •  •  • . 

Some  special  signals  are  defined  next. 

a.  A  signal  is  causal  if  x[n ]  =  0  for  n  <  0,  for  example,  x[n\  =  u[n\. 

b.  A  signal  is  anticausal  if  x[n]  =  0  for  n  >  0,  for  example,  x[n]  =  u[—n\. 

c.  A  signal  is  even  if  x[—n]  —  x[n]  or  it  is  symmetric  about  n  =  0,  for  example, 

x[n ]  =  cos(27r/on). 

d.  A  signal  is  odd  if  x[—n]  =  —x[n\  or  it  is  antisymmetric  about  n  —  0,  for  example, 

x[n]  =  sin(27r/on). 

e.  A  signal  is  stable  if  I^WI  <  00  (also  called  absolutely  summable ),  for 

example,  x[n]  =  (1/2 )nu[n\. 

D.2  Linear  Transforms 

D.2.1  Discrete-Time  Fourier  Transforms 

The  discrete-time  Fourier  transform  X(f)  of  a  discrete-time  signal  x[n]  is  defined 
as 

oo 

x(f)=  ^2  x[n\  exp(-j27r/n)  -  1/2  <  /  <  1/2.  (D.2) 

77— —  OO 

An  example  is  x[n]  =  (1/2 )nu[n]  for  which  X(f )  =  1/(1  —  (1/2)  exp(— j2irf)). 
It  converts  a  discrete-time  signal  into  a  complex  function  of  /,  where  /  is  called 
the  frequency  and  is  measured  in  cycles  per  sample.  The  operation  of  taking  the 
Fourier  transform  of  a  signal  is  denoted  by  T{x[n ]}  and  the  signal  and  its  Fourier 
transform  are  referred  to  as  a  Fourier  transform  pair .  The  latter  relationship  is 
usually  denoted  by  x[n]  <=>  X(f).  The  discrete-time  Fourier  transform  is  periodic  in 
frequency  with  period  one  and  for  this  reason  we  need  only  consider  the  frequency 
interval  [—1/2, 1/2].  Since  the  Fourier  transform  is  a  complex  function  of  frequency, 
it  can  be  represented  by  the  two  real  functions 

W)l 

m 
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Signal  name 

x[n] 

x(f)  (4</<i) 

Unit  impulse 

Real  sinusoid 

Complex  sinusoid 

Exponential 

Double-sided  exponential 

cr  1  /  1  n  =  0 

^  \  0  n  /  0 

cos(27r/on) 

exp(j'27r/0n) 

anu[n] 

alnl 

1 

+  fo)  +  y(f  -  /o) 
S(f  -  fo) 

\  _  I  /y  I  ^  1 

1— aexp(— j27r/)  ‘  ‘ 

u  1-0,2  n  ^  1 

1+a2  —2  a  cos(27 r/) 

Table  D.l:  Discrete-time  Fourier  transform  pairs. 


which  are  called  the  magnitude  and  phase ,  respectively.  For  example,  if  x[n]  = 
(1/2 )nu[n],  then 


\x(f)\ 

m 


1/5/4  —  cos(27r/) 

\  sin(27T /) 

—  —  arctan  — - - - - . 

1  —  \  cos(27 r/) 


Note  that  the  magnitude  is  an  even  function  or  \X(—f)\  =  \X(f)\  and  the  phase 
is  an  odd  function  or  =  —</>(/).  Some  Fourier  transform  pairs  are  given  in 

Table  D.l.  Some  important  properties  of  the  discrete-time  Fourier  transform  are: 


a.  Linearity  -  T{ax[n ]  +  by[n]}  =  aX(f)  +  bY(f) 

b.  Time  shift  -  T{x[n  —  no]}  =  exp(—j27rfno)X(f) 

c.  Modulation  -  Jr{cos(27r/on)^[n]}  =  \ X(f  +  /o)  +  \X(f  —  /o) 

d.  Time  reversal  -  F{x[— n]}  =  X*(f) 

e.  Symmetry  -  if  x[n\  is  even,  then  X(f)  is  even  and  real,  and  if  x[n]  is  odd,  then 

X(f)  is  odd  and  purely  imaginary. 

f.  Energy  -  the  energy  defined  as  can  b e  found  from  the  Fourier 

transform  using  ParsevaVs  theorem 
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g.  Inner  product  -  as  an  extension  of  Parseval’s  theorem  we  have 

OO  pi 

Y  x[n]y[n ]  =  \  X*(f)Y(f)df. 

71— —  OO  2 

Two  signals  x[n]  and  y[n]  are  said  to  be  convolved  together  to  yield  a  new  signal 
z[n]  if 

OO 

z[n]  =  x[k]y[n  —  k]  —  oo  <  n  <  oo. 

k=— oo 

As  an  example,  if  x[n]  =  u[n\  and  y[n]  =  u[n\,  then  z[n]  =  (n+l)u[n].  The  operation 
of  convolving  two  signals  together  is  called  convolution  and  is  implemented  using  a 
convolution  sum.  It  is  denoted  by  a;[n]*y[n].  The  operation  is  commutative  in  that 
x[n ]  *y[n]  =  y[n]  *a;[n]  so  that  an  equivalent  form  is 

oo 

z[n]  =  y[k]x[n  —  k\  —  oo  <  n  <  oo. 

k— — oo 

As  an  example,  if  y[n ]  =  S[n  —  no],  then  it  is  easily  shown  that  x[n]  *  d[n  —  no]  = 
S[n  —  no]  *x[n]  =  x[n  —  no].  The  most  important  property  of  convolution  is  that  two 
signals  that  are  convolved  together  produce  a  signal  whose  Fourier  transform  is  the 
product  of  the  signals’  Fourier  transforms  or 

P{x{n}*y[n}}  =  X(f)Y(f). 

Two  signals  x[n]  and  y[n]  are  said  to  be  correlated  together  to  yield  a  new  signal 
z[n\  if 

(X) 

z[n ]  =  ^  x[k]y[k  +  n]  —  oo  <  n  <  oo. 

k——oc 

The  Fourier  transform  of  z[n\  is  X*(f)Y(f).  The  sequence  z[n]  is  also  called  the 
deterministic  cross-correlation.  If  x[n]  =  y[n],  then  z[n]  is  called  the  deterministic 
autocorrelation  and  its  Fourier  transform  is  |X(/)|2. 

The  discrete-time  signal  may  be  recovered  from  its  Fourier  transform  by  using 
the  discrete-time  inverse  Fourier  transform 

x[n\  =  /  X(f )  exp(j27r/n)<i/  —  oo  <  n  <  oo.  (D-3) 

As  an  example,  if  X(f)  =  +  /o)  +  —  /o),  then  the  integral  yields  x[n]  = 

cos(27r/on).  It  also  has  the  interpretation  that  a  discrete-time  signal  x[n\  may  be 
thought  of  as  a  sum  of  complex  sinusoids  X(f)  exp(j27r/n)A/  for  — 1/2  <  /  <  1/2 
with  amplitude  \X(f)\Af  and  phase  ZX(f).  There  is  a  separate  sinusoid  for  each 
frequency  /,  and  the  total  number  of  sinusoids  is  uncountable. 
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D.2.2  Numerical  Evaluation  of  Discrete-Time  Fourier  Transforms 

The  discrete-time  Fourier  transform  of  a  signal  x [n] .  which  is  nonzero  only  for  n  = 
0, 1, . . . ,  N  —  1,  is  given  by 

N-l 

X(f)  =  ^2  x N  exp(-j27r/n)  -  1/2  <  /  <  1/2.  (D.4) 

71—0 

Such  a  signal  is  said  to  be  time-limited.  Since  the  Fourier  transform  is  periodic 
with  period  one,  we  can  equivalently  evaluate  it  over  the  interval  0  <  /  <  1.  Then, 
if  we  desire  the  Fourier  transform  for  —1/2  <  /'  <  0,  we  use  the  previously  eval¬ 
uated  X(f )  with  /  =  /'  +  1.  To  numerically  evaluate  the  Fourier  transform  we 
therefore  can  use  the  frequency  interval  [0,1]  and  compute  samples  of  X(f )  for 
/  =  0, 1/iV,  2/Nj . . . ,  (N  —  1  )/N.  This  yields  the  discrete  Fourier  transform  (DFT) 
which  is  defined  as 

N-l 

X[k]  =  X(f) \f=k/N  =  ^2  x[n\  exp  (—j2jn{k/N)n)  k  =  0, 1, . . . ,  N  -  1. 

71=0 

Since  there  are  only  N  time  samples,  we  may  wish  to  compute  more  frequency 
samples  since  X(f)  is  a  continuous  function  of  frequency.  To  do  so  we  can  zero 
pad  the  time  samples  with  zeros  to  yield  a  new  signal  x'[n]  of  length  M  >  N  with 
samples  { x  [0]  ,x  [1]  , . . . ,  x[N  —  1],  0, 0, . . . ,  0}.  This  new  signal  x'[n]  will  consist  of  N 
time  samples  and  M  —  N  zeros  so  that  the  DFT  will  compute  more  finely  spaced 
frequency  samples  as 

M—l 

X[k]  =  X{f)\f=k/M  =  Y2  x'[n]exj>(-j2Tr(k/M)n)  k  =  0, 1, . . .  ,M  -  1 

71=0 

N-l 

=  x[n]  exp  (—j2'ir(k/M)n)  k  =  0, 1, . . . ,  M  —  1. 

71=0 

The  actual  DFT  is  computed  using  the  fast  Fourier  transform  (FFT),  which  is  an 
algorithm  used  to  reduce  the  computation. 

The  inverse  Fourier  transform  of  an  infinite  length  causal  sequence  can  be  ap¬ 
proximated  using  an  inverse  DFT  as 

x[n]  =  f  X(f)exp(j2nfn)df  =  f  X{f)exp(j2nfn)df 

J- 1  J(> 

1  M—l 

~  —  ^2  x[k]  exp  (j2ir(k/M)n)  n  =  0, 1, . . .  ,M  —  1.  (D.5) 

1  k= 0 

One  should  choose  M  large.  The  actual  inverse  DFT  is  computed  using  the  inverse 
FFT. 
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D.2.3  z- Transforms 

The  z-transform  of  a  discrete-time  signal  x  [n]  is  defined  as 


OO 

X( z )  =  x[n]z~n 

ri— — oo 


(D.6) 


where  z  is  a  complex  variable  that  takes  on  values  for  which  \X(z)\  <  oo.  As  an 
example,  if  x[n\  =  (l/2)nu[n],  then 


X(z) 


lz-i 

2Z 


Z 


> 


(D.7) 


The  operation  of  taking  the  ^-transform  is  indicated  by  Z{x[n]}.  Some  important 
properties  of  the  ^-transform  are: 

a.  Linearity  -  Z{ax[n]  +  by[n]}  =  aX(z)  +  by(z) 

b.  Time  shift  -  Z{x[n  —  no]}  =  z~n°X(z) 


c.  Convolution  -  Z{x[n ]  ★y[n]}  =  X(z)y(z) 


Assuming  that  the  ^-transform  converges  on  the  unit  circle,  the  discrete-time  Fourier 
transform  is  given  by 


X(f)  —  X(z)\z=exp(j2nf) 


(D.8) 


as  is  seen  by  comparing  (D.6)  to  (D.2).  As  an  example,  if  x[n\  =  (1/2 )nn[n],  then 
from  (D.7) 


X(f)  = 


_ 1 _ 

1  -  iexp(-j27T/) 


since  X(z)  converges  for  \z\  =  |exp(j27r/)|  =  1  >  1/2. 


D.3  Discrete-Time  Linear  Systems 

A  discrete-time  system  takes  an  input  signal  x[n]  and  produces  an  output  signal  y[n\. 
The  transformation  is  symbolically  represented  as  y[n]  =  £{a;[n]}.  The  system  is 
linear  if  C{ax[n ]  +  by  in] }  =  aC{x[n ]}  +  bC{y[n]}.  A  system  is  defined  to  be  shift 
invariant  if  C{x\n  —  no]}  =  y[n  —  no].  If  the  system  is  linear  and  shift  invariant 
(LSI) ,  then  the  output  is  easily  found  if  we  know  the  output  to  a  unit  impulse.  To 
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see  this  we  compute  the  output  of  the  system  as 


y[n]  =  C{x[n}} 


OO 


=  C  ^  ^  x[k\ 5[n  —  k] 

k— — oo 


OO 


=  X  xik]£{sin  -  fc]} 


k=— oo 
oo 


X  xik]  £{<%])! 


n-An—k 


(using  (D.l)) 


(linearity) 


(shift  invariance) 


k=— oo 
oo 


:r[&]/i[n  —  fc] 


k=— oo 


where  /i[n]  =  £{£[rc]}  is  called  the  impulse  response  of  the  system.  Note  that 
y[n]  =  x[n]  *  h[n]  —  /i[n]  *  a;[n]  and  so  the  output  of  the  LSI  system  is  also  given  by 
the  convolution  sum 

oo 

y[n\  =  E  h[k\x[n  —  k].  (D.9) 

k=—oo 

A  causal  system  is  defined  as  one  for  which  h[k]  =  0  for  k  <  0  since  then  the  output 
depends  only  on  the  present  input  x[n]  and  the  past  inputs  x[n  —  k]  for  k  >  1.  The 
system  is  said  to  be  stable  if 

oo 

X  \hM  <  oo- 

k— — oo 

If  this  condition  is  satisfied,  then  a  bounded  input  signal  or  \x[n]\  <  oo  for  — oo  < 
n  <  oo  will  always  produce  a  bounded  output  signal  or  \y[n]\  <  oo  for  —  oo  <  n  <  oo. 
As  an  example,  the  LSI  system  with  impulse  response  h[k]  =  (1/2 )ku[k\  is  stable 
but  not  the  one  with  impulse  response  h[k]  =  u[k\.  The  latter  system  will  produce 
the  unbounded  output  y[n]  =  (n  +  l)u[n]  for  the  bounded  input  x[n]  =  u[n]  since 
u[n]  'k  u[n ]  =  (n  +  1  )u[n]. 

Since  for  an  LSI  system  y[n]  =  h[n]  ★rrfn],  it  follows  from  the  properties  of  £- 
transforms  that  y{z)  =  T-L{z)X{z ),  where  H{z)  is  the  ^-transform  of  the  impulse 
response.  As  a  result,  we  have  that 


H(z) 


y(z)  _  Output  ^-transform 
X{z)  Input  ^-transform 


(D.10) 


and  'H(z)  is  called  the  system  function.  Note  that  since  it  is  the  ^-transform  of  the 
impulse  response  h[n]  we  have 


oo 

R(z)  —  X  h[n]z~n. 

k— — oo 


(D.ll) 
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If  the  input  to  an  LSI  system  is  a  complex  sinusoid,  a;[n]  =  exp(j2nfon),  then 
the  output  is  from  (D.9) 

OO 

y[n]  =  ^2  h[k]exp[j2nfo(n-k)] 

k=— oo 

OO 

=  h[k]exp(-j2TTfok)exp(j2nf0n).  (D.12) 

k=—oo 

N - v - ' 

H(fo) 


It  is  seen  that  the  output  is  also  a  complex  sinusoid  with  the  same  frequency  but 
multiplied  by  the  Fourier  transform  of  the  impulse  response  evaluated  at  the  sinu¬ 
soidal  frequency.  Hence,  H(f)  is  called  the  frequency  response.  Also,  from  (D.12) 
the  frequency  response  is  obtained  from  the  system  function  (see  (D.ll))  by  let¬ 
ting  z  =  exp(j27r/).  Finally,  note  that  the  frequency  response  is  the  discrete-time 
Fourier  transform  of  the  impulse  response.  As  an  example,  if  h[n]  =  (1/2 )nu[n], 
then 


H(z) 


-1 


and 


H{f )  =  'H(exp(j2n  f))  = 


_ 1 _ 

1  ~  \  exp(-j2nf) 


The  magnitude  response  of  the  LSI  system  is  defined  as  \H(f)\  and  the  phase  re¬ 
sponse  as  ZH(f). 

As  we  have  seen,  LSI  systems  can  be  characterized  by  the  equivalent  descriptions: 
impulse  response,  system  function,  or  frequency  response.  This  means  that  given 
one  of  these  descriptions  the  output  can  be  determined  for  any  input.  LSI  systems 
can  also  be  characterized  by  linear  difference  equations  with  constant  coefficients. 
Some  examples  are 


yi  [n]  =  x[n]  —  bx[n  —  1] 

y2[n]  =  ay2[n  -  1]  +  x[n) 

yz[n]  =  ayz[n  -  1]  +  x[n]  -  bx[n  -  1] 

and  more  generally 

p  q 

y[n]  =  E  a[k]y[n  —  k]  +  x[n]  —  6[fc]a:[n  —  k].  (D.13) 

k= 1  k= 1 


The  system  function  is  found  by  taking  the  ^-transform  of  both  sides  of  the  difference 
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equations  and  using  (D.10)  to  yield 

y\{z)  =  X(z)  —  bz~1X(z)  =>  Ri(z)  =  1  —  bz~l 
y2(z)  =  az~1y2(z )  +  X(z)  =►  U2{z)  =  T — ~  zt 

I  —  az  1 

1  _  bz  ^ 

3M-z)  =  az~ly$(z)  +  X(z)  -  bz~1X(z)  =>•  ft3(z)  =  — - ^ 

JL 


and  the  frequency  response  is  obtained  using  H(f)  =  'H(exp(j2irf)).  More  generally, 
for  the  LSI  system  whose  difference  equation  description  is  given  by  (D.13)  we  have 


1  -  SLi  M*]*-* 

1  -  EL,  «[*]*-* ' 


(D.14) 


The  impulse  response  is  obtained  by  taking  the  inverse  ^-transform  of  the  system 
function  to  yield  for  the  previous  examples 


(  1  n  —  0 

hi  [n]  = 

<  —  b  n  =  1 

[  0  otherwise 

h2  [n]  = 

anu[n ] 

(assuming  system  is  causal) 

h3  [n]  = 

anu[n]  —  ban~1u[n  —  1] 

(assuming  system  is  causal). 

The  impulse  response  could  also  be  obtained  by  letting  x[n\  =  S[n]  in  the  difference 
equations  and  setting  y[—  1]  =  0,  due  to  causality,  and  recursing  the  difference 
equation.  For  example,  if  the  difference  equation  is  y[n]  =  (1/2 )y[n  —  1]  +#[n],  then 
by  definition  the  impulse  response  satisfies  the  equation  h[n]  =  (1/2 )h[n  —  1]  +  S[n\. 
By  recursing  this  we  obtain 


Mo]  = 

5M-1]  +  Mo]  =  1 

(since  h[— 1]  =  0  due  to  causality) 

Mi]  = 

1M0]  +  <s[i]  =  \ 

(since  <5[n]  =  0  for  n  >  1) 

M2]  = 

toil-1 

'  h-\ 

II 

rf^li— 1 

etc. 

and  so  in  general  we  have  the  impulse  response  h[n\  =  (1/2 )nu[ri\.  The  system  with 
impulse  response  h\  [n]  is  called  a  finite  impulse  response  (FIR)  system  while  those  of 
fi2[n]  and  hs[n]  are  called  infinite  impulse  response  (HR)  systems.  The  terminology 
refers  to  the  number  of  nonzero  samples  of  the  impulse  response. 

For  the  system  function  Hs(z)  =  (1  —  bz~l)/(  1  —  az-1),  the  value  of  2  for  which 
the  numerator  is  zero  is  called  a  zero  and  the  value  of  z  for  which  the  denominator  is 
zero  is  called  a  pole.  In  this  case  the  system  function  has  one  zero  at  z  =  b  and  one 
pole  at  z  —  a.  For  the  system  to  be  stable,  assuming  it  is  causal,  all  the  poles  of  the 
system  function  must  be  within  the  unit  circle  of  the  z -plane.  Hence,  for  stability 
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we  require  \a\  <  1.  The  zeros  may  lie  anywhere  in  the  2-plane.  For  a  second-order 
system  function  (let  p  =  2  and  q  =  0  in  (D.14))  given  as 


H(z)  = 


1  —  a[l]2_1  —  a[2]z~ 2 


the  poles,  assuming  they  are  complex,  are  located  at  2  =  rexp(±j‘0).  Hence,  for 
stability  we  require  r  <  1  and  we  note  that  since  the  poles  are  the  2  values  for  which 
the  denominator  polynomial  is  zero,  we  have 

1  —  a[l]2_1  —  a[2]z~2  =  z~2(z  —  r  exp(j0))(z  —  rexp(-jO)). 


Therefore,  the  coefficients  are  related  to  the  complex  poles  as 


a[l]  =  2rcos(0) 
a[2]  =  -r2 


which  puts  restrictions  on  the  possible  values  of  a[  1]  and  a[ 2].  As  an  example,  the 
coefficients  a[  1]  =  0,  a[ 2]  =  —1/4  produce  a  stable  filter  but  not  a[  1]  =  0,  a[ 2]  =  —2. 
An  LSI  system  whose  frequency  response  is 


H(f)  = 


1  I/I  <5 

0  |/|>S 


is  said  to  be  an  ideal  lowpass  filter.  It  passes  complex  sinusoids  undistorted  if  their 
frequency  is  |/|  <  B  but  nullifies  ones  with  a  higher  frquency.  The  band  of  positive 
frequencies  from  /  =  0  to  /  =  B  is  called  the  passband  and  the  band  of  positive 
frequencies  for  which  /  >  B  is  called  the  stopband. 


D.4  Continuous-Time  Signals 

A  continuous-time  signal  is  a  function  of  time  x(t)  for  —  00  <  t  <  00.  Some  impor¬ 
tant  signals  are: 


a.  Unit  impulse  -  It  is  denoted  by  S(t).  An  impulse  J(£),  also  called  the  Dirac  delta 
function ,  is  defined  as  the  limit  of  a  very  narrow  pulse  as  the  pulsewidth  goes 
to  zero  and  the  pulse  amplitude  goes  to  infinity,  such  that  the  overall  area 
remains  at  one.  Therefore,  if  we  define  a  very  narrow  pulse  as 

xr(t)  =  |  J 

then  the  unit  impulse  is  defined  as 


\t\  <  T/2 
\t\  >  T/2 


S(t)  =  lim 

T-+0 


xT(t). 
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The  impulse  has  the  important  sifting  property  that  if  x(t)  is  continuous  at 
t  =  to,  then 

/OO 

x(t)6(t  —  to)dt  =  x(to). 

-oo 

b.  Unit  step  -  x(t)  =  1  for  t  >  0  and  x(t)  =  0  for  t  <  0.  It  is  also  denoted  by  u(t). 

c.  Real  sinusoid  -  x(t)  =  Acos(27rFoi  +  6)  for  — oo  <  t  <  oo,  where  A  is  the 
amplitude  (must  be  nonnegative),  Fq  is  the  frequency  in  Hz  (cycles  per  second), 
and  0  is  the  phase  in  radians. 

d.  Complex  sinusoid  -  x(t)  =  Aexp(j2nFot  +  6)  for  — oo  <  t  <  oo,  with  the 
amplitude,  frequency,  and  phase  taking  on  same  values  as  for  real  sinusoid. 

e.  Exponential  -  x(t)  =  exp (at)u(t) 

f.  Pulse  -  x(t)  =  1  for  \t\  <  T/2  and  x(t )  =  0  for  \t\  >  T/2. 

Some  special  signals  are  defined  next. 

a.  A  signal  is  causal  if  x(t)  =  0  for  t  <  0,  for  example,  x(t)  =  u(t). 

b.  A  signal  is  anticausal  if  x(t)  =  0  for  t  >  0,  for  example,  x(t)  =  u{—t). 

c.  A  signal  is  even  if  x(—t)  =  x(t)  or  it  is  symmetric  about  t  =  0,  for  example, 

x(t)  —  cos(2nFot). 

d.  A  signal  is  odd  if  x(—t)  =  —x(t)  or  it  is  antisymmetric  about  t  =  0,  for  example, 
x(t)  =  sin(27ri?ot). 

e.  A  signal  is  stable  if  \x(t)\dt  <  oo  (also  called  absolutely  integrable ),  for  ex¬ 
ample,  x(t)  =  exp  (—t)u(t). 


D.5  Linear  Transforms 

D.5.1  Continuous-Time  Fourier  Transforms 

The  continuous-time  Fourier  transform  X(F)  of  a  continuous-time  signal  x(t)  is 
defined  as 

/oo 

x(t)  exp(— j2nFt)dt  —  oo  <  F  <  oo.  (D.15) 

-OO 

An  example  is  x(t)  =  exp (—t)u(t)  for  which  X(F)  =  1/(1  +  j2nF).  It  converts  a 
continuous-time  signal  into  a  complex  function  of  F,  where  F  is  called  the  frequency 
and  is  measured  in  Hz  (cycles  per  second).  The  operation  of  taking  the  Fourier 
transform  of  a  signal  is  denoted  by  JF{x(t)}  and  the  signal  and  its  Fourier  trans¬ 
form  are  referred  to  as  a  Fourier  transform  pair.  The  latter  relationship  is  usually 
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Signal  name 

x{t) 

X{F) 

Unit  impulse 

Real  sinusoid 

Complex  sinusoid 

Exponential 

Pulse 

'w-{s° 

cos(27rFoi) 

exp(j2nF0t) 

exp  (—at)u(t) 

_ri  i*i<t/2 

\  0  |i|  >  Tj 2 

1 

i<5(F  +  F0)  +  \5{F  -  F0) 
S(F  -  F0) 

a+j2nF  a  >  0 

/71sin(7rFT) 

1  7 tFT 

Table  D.2:  Continuous-time  Fourier  transform  pairs. 


denoted  by  x(t)  44  X(F).  Note  that  the  magnitude  of  X(F )  is  an  even  function 
or  \X(—F)\  =  \X(F)\  and  the  phase  is  an  odd  function  or  =  —  (/>(F).  Some 

Fourier  transform  pairs  are  given  in  Table  D.2. 

Some  important  properties  of  the  continuous-time  Fourier  transform  are: 

a.  Linearity  -  F{ax(t)  +  by(t)}  =  aX(F)  +  bY(F) 

b.  Time  shift  -  F{x(t  —  to)}  =  exp(—  j27rFto)X(F) 

c.  Modulation  -  !F{cos(27rFot)x(t)}  =  \X(F  +  Fo)  +  \X(F  —  Fo) 

d.  Time  reversal  -  F{x{— t)}  =  X*(F) 

e.  Symmetry  -  if  x(t)  is  even,  then  X(F)  is  even  and  real,  and  if  x(t)  is  odd,  then 

X(F)  is  odd  and  purely  imaginary. 

f.  Energy  -  the  energy  defined  as  f ^  x2(t)dt  can  be  found  from  the  Fourier  trans¬ 

form  using  ParsevaVs  theorem 

/OO  POO 

x2(t)dt=  /  \X(F)\2dF. 

-OO  J  —  OO 

g.  Inner  product  -  as  an  extension  of  Parseval’s  theorem  we  have 

/OO  POO 

x(t)y(t)dt  —  /  X*  (F)Y  (F)dF. 

-OO  J  —  OO 

Two  signals  x(t)  and  y(t )  are  said  to  be  convolved  together  to  yield  a  new  signal 
z(t )  if 

/OO 

x(T)y(t  —  r)dr  —  oo  <  t  <  oo. 

-OO 
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As  an  example,  if  x(t)  —  u(t)  and  y(t)  —  u(t ),  then  z(t )  =  tuft).  The  operation 
of  convolving  two  signals  together  is  called  convolution  and  is  implemented  using  a 
convolution  integral.  It  is  denoted  by  x(t)  *y(t).  The  operation  is  commutative  in 
that  x(t )  *y{t)  =  y(t)  *x(t)  so  that  an  equivalent  form  is 

/oo 

y(r)x(t  —  r)dr  —  oo  <  t  <  oo. 

-oo 

As  an  example,  if  y{t)  =  8(t  —  to)5  then  it  is  easily  shown  that  x(t)  *  8(t  —  to)  = 
8(t  —  to)  *x(t)  =  x(t  —  to)-  The  most  important  property  of  convolution  is  that  two 
signals  that  are  convolved  together  produce  a  signal  whose  Fourier  transform  is  the 
product  of  the  signals’  Fourier  transforms  or 

?{x(t)*y(t)}  =  X{F)Y{F). 

The  continuous-time  signal  may  be  recovered  from  its  Fourier  transform  by  using 
the  continuous-time  inverse  Fourier  transform 

/oo 

X(F)  exp(j2nFt)dF  —  OO  <  t  <  oo.  (D.16) 

-oo 

As  an  example,  if  X(F)  =  \8{F  +  Fo)  +  ^8(F  —  Fb),  then  the  integral  yields  x(t)  = 
cos(27 rFot).  It  also  has  the  interpretation  that  a  continuous-time  signal  x(t)  may  be 
thought  of  as  a  sum  of  complex  sinusoids  X (F)  exp(j27rFt)AF  for  —  oo  <  F  <  oo 
with  amplitude  |X(F)|AF  and  phase  ZX(F).  There  is  a  separate  sinusoid  for  each 
frequency  F7,  and  the  total  number  of  sinusoids  is  uncountable. 

D.6  Continuous-Time  Linear  Systems 

A  continuous-time  system  takes  an  input  signal  x(t )  and  produces  an  output  signal 
y(t).  The  transformation  is  symbolically  represented  as  y(t)  =  C{x{t)}.  The  system 
is  linear  if  C{ax{t)  +  by(t)}  =  aC{x(t)}  +  bC{y(t)}.  A  system  is  defined  to  be  time 
invariant  if  £{x(t  —  to)}  =  y(t  —  to).  If  the  system  is  linear  and  time  invariant  (LTI), 
then  the  output  is  easily  found  if  we  know  the  output  to  a  unit  impulse.  It  is  given 
by  the  convolution  integral 

/oo 

h(r)x(t  —  r)d,T  (D.17) 

-OO 

where  h(t)  =  £{£(£)}  is  called  the  impulse  response  of  the  system.  A  causal  system 
is  defined  as  one  for  which  h(r)  =  0  for  r  <  0  since  then  the  output  depends  only 
on  the  present  input  x(t)  and  the  past  inputs  x(t  —  r)  for  r  >  0.  The  system  is  said 
to  be  stable  if 


808 


APPENDIX  D.  REVIEW  OF  SIGNALS  AND  SYSTEMS 


If  this  condition  is  satisfied,  then  a  bounded  input  signal  or  \x(t)\  <  oo  for  — oo  < 
t  <  oo  will  always  produce  a  bounded  output  signal  or  \y{t)\  <  oo  for  —  oo  <  t  < 
oo.  As  an  example,  the  LTI  system  with  impulse  response  h{r)  —  exp(— r)u(r)  is 
stable  but  not  the  one  with  impulse  response  h(r)  =  u(r).  The  latter  system  will 
produce  the  unbounded  output  y(t)  =  tuft )  for  the  bounded  input  x(t)  =  u(t)  since 
u(t)  *u(t)  =  tuft). 

If  the  input  to  an  LTI  system  is  a  complex  sinusoid,  xft)  =  exp(j27rFot),  then 
the  output  is  from  (D.17) 


/oo 

h(r)  exp{j2nF0(t  -  r)]rfr 

-OO 


■OO 

‘OO 


/OO 

h(r)  exp(—j2nFoT)dT  exp(j2nFot). 

-oo 


(D.18) 


HXFo) 


It  is  seen  that  the  output  is  also  a  complex  sinusoid  with  the  same  frequency  but 
multiplied  by  the  Fourier  transform  of  the  impulse  response  evaluated  at  the  si¬ 
nusoidal  frequency.  Hence,  H(F)  is  called  the  frequency  response.  Finally,  note 
that  the  frequency  response  is  the  continuous-time  Fourier  transform  of  the  impulse 
response.  As  an  example,  if  h(t)  =  exp(— at)u(t),  then  for  a  >  0 


H(F)  = 


1 

a  +  j2irF 


The  magnitude  response  of  the  LSI  system  is  defined  as  \H(F)\  and  the  phase  re¬ 
sponse  as  Z H(F ). 

An  LTI  system  whose  frequency  response  is 


H(F)  = 


1  \F\  <  W 
0  \F\  >  W 


is  said  to  be  an  ideal  lowpass  filter.  It  passes  complex  sinusoids  undistorted  if  their 
frequency  is  |F|  <  W  Hz  but  nullifies  ones  with  a  higher  frquency.  The  band  of 
positive  frequencies  from  F  =  0  to  F  =  W  is  called  the  passband  and  the  band  of 
positive  frequencies  for  which  F  >  W  is  called  the  stopband. 
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Appendix  E 

Answers  to  Selected  Problems 


Note:  For  problems  based  on  computer  simulations  the  number  of  realizations  used 
in  the  computer  simulation  will  affect  the  numerical  results.  In  the  results  listed 
below  the  number  of  realizations  is  denoted  by  Nrea\.  Also,  each  result  assumes 
that  randO  state 3 ,0)  and/or  randnC  state 9 ,0)  have  been  used  to  initialize  the 
random  number  generator  (see  Appendix  2 A  for  further  details). 

Chapter  1 

1.  experiment:  toss  a  coin;  outcomes:  {head,  tail};  probabilities:  1/2, 1/2 

5.  a.  continuous;  b.  discrete;  c.  discrete;  d.  continuous;  e.  discrete 

7.  yes,  yes 

10.  P[k  =  9]  =  0.0537,  probably  not 

13.  1/2 

14.  0.9973  for  A  =  0.001 
Chapter  2 

1.  P[Y  =  0]  =  0.7490,  P[Y  =  1]  =  0.2510  (iVreai  =  1000) 

A 

3.  via  simulation:  P[—  1  <  X  <  1]  =  0.6863;  via  numerical  integration  with  A  = 
0.01,  P[- 1  <  X  <  1]  =  0.6851  (iVreai  =  10,000) 

6.  values  near  zero 

8.  estimated  mean  =  0.5021;  true  mean  =  1/2  (iVreai  =  1000) 

11.  estimated  mean  =  1.0042;  true  mean  =  1  (ATreal  =  1000) 

13.  1.2381  (iVreai  =  1000) 
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14.  no;  via  simulation:  mean  of  y/U  =  0.6589;  via  simulation:  y/ mean  of  U  = 
0.7125  (JVreai  =  1000) 

Chapter  3 

1.  a.  Ac  =  {x  :  x  <  1},  Bc  =  {x  :  x  >  2} 

b.  AU  B  =  {x  :  — oo  <  x  <  oo}  =  <S,  AC\  B  =  {x  :  1  <  x  <2} 

c.  A  —  B  =  {x  :  x  >  2},  B  —  A  =  {x  :  x  <  1} 

7.  A  =  {1,2,3},  B  =  {4,5},  C  =  {1,2,3},  D  =  {4,5,6} 
io.  Aubuc  =  (AcnBcn Cc)c,  A n b n c  =  (Ac u bc u Cc)c 

12.  a.  107,  discrete  b.  1,  discrete  c.  oo  (uncountable),  continuous  d.  oo  (uncount¬ 
able),  continuous  e.  2,  discrete  f.  oo  (countable),  discrete 

14.  a.  S  =  {t  :  30  <  t  <  100}  b.  outcomes  are  all  t  in  interval  [30, 100]  c.  set  of 
outcomes  having  no  elements,  i.e.,  {negative  temperatures}  d.  A  =  {t  :  40  < 
t  <  60},  B  =  {t  :  40  <  t  <  50  or  60  <  t  <  70},  C  =  {100}  (simple  event)  e. 

A  =  {t :  40  <  t  <  60},  B  =  {t :  60  <  t  <  70} 

18.  a.  1/2  b.  1/2  c.  6/36  d.  24/36 

19.  Peven  =  1/2,  Peven  =  0.5080  (iVreai  =  1000) 

21.  a.  even,  2/3  b.  odd,  1/3  c.  even  or  odd,  1  d.  even  and  odd,  0 

23.  1/56 

25.  10/36 

27.  no 

33.  90/216 

35.  676,000 

38.  0.00183 

40.  total  number  =  16,  two-toppings  =  6 
44.  a.  4  of  a  kind 

13-48 

(?) 

M?) 


b.  flush 
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49.  P[k  >  95]  =  0.4407,  P[k  >  95]  =  0.4430  (iVrea i  =  1000) 

Chapter  4 

2.  1/4 

5.  1/4 

7.  a.  0.53  b.  0.34 
11.  0.5 
14.  yes 
19.  0.03 

21.  a.  no  b.  no 

22.  4 
26.  0.0439 
28.  5/16 

33.  P[k]  =  (k  -  1)(1  -  p)k~2p2,  k  =  2, 3, ... , 

38.  2  red,  2  black,  2  white 
40.  3/64 
43.  165/512 

Chapter  5 

4.  Sx  —  {0, 1,4,9} 

- 

Px[xi]  =  < 

6.  0  <  p  <  1,  a  —  (1  —  p)/p2 

8.  0.9919 

13.  Average  value  =  5.0310,  true  value  shown  in  Chapter  6  to  be  A  =  5  (iVreai  = 

1000) 

14.  px[5]  =  0.0029,  px[ 5]  =  0.0031  (from  Poisson  approximation) 
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18.  P[X  =  3]  =  0.0613,  P[X  =  3]  =  0.0607  (JVreai  =  10, 000) 
20.  py[ k]  =  exp(— A)Afe/2/fc!  for  k  —  0, 2,4, . . . 

26.  px[k]  =  1/5  for  fc  =  1,2, 3,4, 5 
28.  0.4375 
31.  8.68  x  10-7 


Chapter  6 

2.  9/2 
4.  2/3 

8.  geometric  PMF 

12.  (2/p2)  -  1/p 

13.  yes,  if  X  =  constant 

14.  predictor  =  E[X\  =  21/8,  msemin  =  47/64  =  0.7343 

15.  estimated  msemin  =  0.7371  (iVreai  =  10,000) 

20.  A2  +  A 

26.  ELo(-l )n~k  H)  En~k[X}E[Xk} 

27.  <Py{w)  =  exp(jub)(fix{aw) 

28.  (1  +  2cos(o;)  +  2cos(2o;))/5 

32.  true  mean  =  1/2,  true  variance  =  3/4  ;  estimated  mean  =  0.5000,  estimated 
variance  =  0.7500  (iVreai  =  1000) 


Chapter  7 


3.  S  =  {(p,n),  (p,d),  (n,p),  (n,d),  (d,p),  (d,n)} 

=  {(1, 5),  (1, 10),  (5, 1),  (5, 10),  (10, 1),  (10, 5)} 


Px,v[i,j\  = 


(i,j)  =  (0,0) 

(t,j)  =  (l,-l) 

(*,/)  =  (!,!) 
(i,j)  =  (2,0) 


10.  1/5 
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13.  px[i]  =  (1  —  pY  lp  for  i  =  1,2,...  and  same  for  pv[j] 


16.  px,r[0,0]  =  1/4,  px,r[0, 1]  =  0,  px,y[  1,0]  =  1/8,  px,y[  1,1] 


19.  no 

23.  yes,  X  ~  bin(10, 1/2),  Y  ~  bin(ll,  1/2) 

27.  pz[ 0]  =  1/4,  pz[  1]  =  1/2,  pz[ 2]  =  1/4,  variance  always  increases  when  uncorre¬ 
lated  random  variables  are  added 


33.  1/8 

37.  0 

38.  3/22 

40.  minimum  MSE  prediction  —  Ey[Y]  —  5/8  and  minimum  MSE  —  var(E)  — 

15/64  for  no  knowledge 

A 

minimum  MSE  prediction  —  Y  —  — (1/15) x  + 2/3  and  minimum  MSE  = 
var(Y)(l  —  p\y)  =  7/30  based  on  observing  outcome  of  X 

41.  W  =  5.4109/j  -  205.0344 

43.  pWjZ  =  \J rf] (jl  +  1),  where  17  =  Ex[X2]/En[N 2] 

46.  see  solution  for  Problem  7.27 

48.  px,y[ 0,0]  =  0.1190,  pxy[ 0,1]  =  0.1310,  pXy[  1,0]  =  0.2410,  px,y[  1,1]  =  0.5090 

(Aureal  =  1000) 

49.  px.Y  =  y/5/15  =  0.1490,  px,Y  =  0.1497  (iVreai  =  100,000) 

Chapter  8 

2-  Pv\x[j\0\  =  1  for  j  =  0 

PY|xb|l]  =  1/6  for  j  =  1, 2, 3, 4,  5, 6 
P[Y  =  1]  =  1/12 

5.  no,  no,  no 

6.  Py\x[j\Q\  —  1/3  for  j  =  0  and  =  2/3  for  j  —  1 

py|x[j|l]  =2/3  for  j  =  0  and  =  1/3  for  j  =  1 
Px|yN0]  —  1/3  for  i  —  0  and  —  2/3  for  i  —  1 
Px|y[^|1]  =2/3  for  i  =  0  and  =  1/3  for  i  =  1 


8-  Py\x[M  =  1/5  for  j  =  0, 1,  2, 3, 4;  i  =  1,  2 
Px|y[/|j]  =  1/2  for  *  =  1,  2;  j  =  0, 1,  2,  3, 4 
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11.  0.4535 

13.  a.  PY\x[yj\®\  =  0, 1,0  for  yj  =  —l/y/2,0,  1/a/2,  respectively 

PY\x[yj\l/VZ\  =  1/2,0, 1/2  for  yj  =  -l/\/2,0,  l/\/2,  respectively 

PY\x[yj\VZ\  =  0, 1,0  for  yj  =  -l/\/2,0, 1/V2,  respectively 
not  independent  (conditional  PMF  depends  on  xi) 
b.  PY\x[yj\®]  =  1/2,  1/2  for  yj  —  0, 1,  respectively 
PY\x[Vj\l]  =  I/2?  1/2  for  yj  =  0, 1,  respectively 
independent 

17.  pz[k]  =  px[k\  T^=kPY\j]  +Pv[k]  Y.JLk+iPxlj] 

21.  Ey{x[Y\0]  =  0,  Ey\x[Y\l]  =  1/2,  Ey\x[Y\2]  =  1 

22.  var(Y|0)  =  0,  var(Y|l)  =  1/4,  var(Y|2)  =  2/3 

28.  optimal  predictor:  Y  =  0  for  x  =  —1,  Y  =  1/2  for  x  =  0,  and  Y  =  0  for  x  =  1 

A 

optimal  linear  predictor:  Y  =  1/4  for  x  =  —1, 0, 1 
30.  Jvj^[F|0]  =  0.5204,  E^[Y\1]  =  0.6677  (iVreai  =  10,000) 

Chapter  9 

1.  0.0567 

4.  yes 

6.  (Xi,  X2)  independent  of  X3 

10.  E[X]  =  Ex[X],  var(X)  =  var (X)/N 

13.  Cx  =  2  4  ’  det(Cx)  =  0,  no 

17.  a.  no,  b.  no  ,  c.  yes,  d.  no 

2°.  Cx  =  [  3  l 


26.  A 


[  0.9056  0.4242  1  ,  A(rAmTAOCO 

L  =  n  for  MATLAB  5.2 

—0.4242  0.9056 

.  [  -0.9056  0.4242  1  , 

A  =  n  .  for  MAT 

0.4242  0.9056 

var(Yi)  =  7.1898,  var(Y2)  =  22.8102 


for  MATLAB  6.5,  R13 


35.  B  = 


815 


36.  Cx 


4.0693  0.9996 
0.9996  3.9300 


(iVreal  =  1000) 


Chapter  10 


2.  1/80 

4.  a.  no  b.  yes  c.  no 

6.  ol\  >  0,  a.2  >  0,  and  —  1 


12.  0.0252 

14.  Gaussian:  0.0013  Laplacian:  0.0072 

17.  first  person  probability  =  0.393,  first  two  persons  probability  =  0.090 
19.  Fx(x)  =  1/2  +  (I/7 r)  arctan(rr) 

22.  Fx(x)  =  *(*=*) 

28.  2.28% 

30.  eastern  U.S. 

33.  yes 
36.  c  «  14 


40. 


pHv)  = 


A 

4(y — I)3/4 

0 


exp[-A(y  -  I)1/4] 


V  >  1 

y  <  1 


43.  py(y)  =  px(y)  +Px(-y) 


pr{y)  = 


~wv  0  < y  < 1 

0  otherwise 


P[- 

-2  < 

X 

VI 

2] 

P[- 

1 

h-* 

IA 

X 

VI 

1] 

P[- 

V 

t-H 

l 

X 

VI 

1] 

P[- 

-1  < 

X 

< 

1] 

P[- 

VI 

t-H 

1 

X 

< 

1] 

=  1  —  \  exp(— 2) 
=  1  —  5  exp(— 1) 

=  f-5exP(-l) 
=  5~5exP(-l) 

=  f  —  5  exp(— 1) 


51. 
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54.  g(U)  =  V21n(l/(1-E/)) 

Chapter  11 

I.  7/6 
10.  ±9.12 

II.  0.1353 
14.  N 
19.  0.0078 

21.  y/E\U]  =  y/lj2,  E[VU]  =  2/3 

22.  E[s(i0)]  =  0,  P[s2(t0)]  =  1/2 

26.  <t2/2 

27.  a2/ 2 

30.  Tmin  =  5.04,  Tmax  =  8.96 

35.  E[X 3]  =  3/j,a2  +  p3,  E[(X  -  gf]  =  0 

38.  E[Xn]  =  0  for  n  odd,  E[Xn]  =  n\  for  n  even 

42.  <5(x  —  g) 

44.  y/2var(X) 

46.  E[X]  =  1.2533,  E\X]  =  1.2538;  var(X)  =  0.4292,  var(X)  =  0.4269  (iVrea i  = 
1000) 

Chapter  12 
1.  7/16 

3.  no,  probability  is  1/4 

5.  7 r  =  4 P[X2  +  Y2  <  1],  7T  =  3.1140  (iVreal  =  10,000) 

7.  1/4 

10.  P  =  0.19,  P  =  0.1872  (lVrea i  =  10,000) 


11.  0 
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15.  px(x)  =  2x  for  0  <  x  <  1  and  zero  otherwise,  py(y)  =  2(1  —  y)  for  0  <  y  <  1 
and  zero  otherwise 


18. 


Fx,y{x,v) 


r  0 

x  <  0  or  y  <  0 

\xy 

0<#<2,0<y<4 

\y 

x  >  2,0  <  y  <  4 

\x 

0<x<2,y>4 

.  1 

x  >  2,  y  >  4 

23.  (1  -  exp(— 2))2 
25.  no 


26.  Q( 2) 

30.  P  [bulls  eye]  =  1  —  exp(— 2)  =  0.8646,  P[bullseye]  =  0.8730  (iVreai  =  1000) 
36.  W  ~  A f(nw,  o$y),  Z  ~  Af(pz,  a%) 


38. 


W 

Z 


Af{p>,  C),  where 


P  = 

C  = 


3 

8  _ 

2  5 
5  14 


43.  \/5t f 


45.  uncorrelated  but  not  necessarily  independent 


47. 


52.  Q{  1) 


Chapter  13 

2.  yes,  c  =  1/x 

4.  Py\x{v\x)  =  exp(— y)/(l  —  exp(— x))  for  0  <  y  <  x,  x  >  0 
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8.  px,y(x,y)  =  I/x  for  0  <  y  <  x,  0  <  x  <  1;  py(y)  =  —  lny  for  0  <  y  <  1 

10.  pY\x{y\x)  =  I/x  for  0  <  y  <  x,  0  <  x  <  1;  Px\y{x\y)  =  1/(1 -y)  for  y  <  x  <  1, 
0  <  y  <  1 

1 

14.  Use  P  =  f2 1  JP[|-X_2 1  —  |-Xi|  <  0|Xi  =  xi]px1(xi)dxi  and  note  independence  of 

2 

X\  and  X2  so  that  P  =  f\  P[|X2|  <  x\]dx\ 

2 

16.  Q(— 1),  assume  R  and  E  are  independent 

21.  1/2 

24.  Use  E(X+y) \x[X  +  Y\x ]  =  Ey\x[Y\x]  +  x  to  yield  E(X+y)\x[X  +  Y\X  —  50]  = 
77.45  and  E(x+y)\x[X  -t-  =  75]  =  84.57 

Chapter  14 

1.  Ey[Y]  =  6,  var(Y)  =  11/2 
6.  1/16 

9.  Y  ~  N(0,  a\  +  a\  4-  03) 

12.  no  since  var(X)  — >>  cr2/ 2  as  N  oo 
19.  Ey[Y]  =  0,  var (Y)  =  1 

21.  X3  =  7/5 

24.  msemin  =  8/15  =  0.5333 

25.  msemin  =  0.5407  (iVreai  =  5000) 

Chapter  15 

4.  limjv^-oo  Eili  =  0,  ai  =  1  /IV3/4 

7.  no  since  the  variance  does  not  converge  to  zero 

13.  Y  ~  jV(2000, 1000/3) 

19.  N  =  5529 

20.  1  -  Q(— 77.78)  «  0 

22.  Gaussian,  “converges”  for  all  N  >  1 

23.  no  since  approximate  95%  confidence  interval  is  [0.723, 0.777] 


819 


26.  drug  group  has  approximate  95%  confidence  interval  of  [0.69, 0.91]  and  placebo 
group  has  [0.47, 0.73].  Can’t  say  if  drug  is  effective  since  true  value  of  p  could 
be  0.7  for  either  group. 

Chapter  16 

1.  a.  temperature  at  noon  b.  expense  in  dollars  and  cents  c.  time  left  in  hours  and 
minutes 

4.  p50(l  —  p)50,  0 

7.  E[X[n}\  =  (n  +  l)/2,  var(X[n])  =  (3/4)(n  +  1) 

9.  exp(— 3) 

13.  independent  but  not  stationary 

16.  px[n]  =  0,  cx[n\,ri2]  =  S[n 2  —  ni],  exactly  the  same  as  for  WGN  with  o\j  =  1 

18.  P[X[n]  >  3]  =  0.000011,  P[U[n\  >  3]  =  0.0013 
22.  E[X(t)]  =  0,  cx  (ti ,  £2)  —  cos(27rti)  cos(27r^2) 

24.  J5[y[n]]  =  0,  cov(y[0],y[l])  =  —1,  not  IID  since  samples  are  not  independent 

26.  E[X[n]]  =  0,  cx[ni,n2]  =  a^min(ni,n2) 

27.  cx[l,l]  -  1/2,  <*[1,2]  =  1/4,  <*[1,3]  =  0,  cx[  1,1]  =  0.5057,  cx[l,2]  =  0.2595, 

CX[1,3]  =  -0.0016  (JVreal  =  10,000) 

31.  px\p\  —  0,  cx[ni,n2]  =  a\a^jS[ri2  —  ni],  white  noise 
34.  N  =  209 

Chapter  17 

1.  yes,  px\p\  —  P  —  2p  —  1,  rx[k]  =  1  for  k  =  0  and  rx[k]  =  p2  for  k  ^  0 

5.  WSS  but  not  stationary  since  px[o]  7^  Px[  1] 

9.  a  >  0,  |6|  <  1 

12.  b,d,e 

17.  £?[X[n]]  =  0,  var(X[n])  =  <r^(l  —  a2(n+1^)/(l  —  a2);  as  n  — >  00,  var(X[n])  — > 

*2/(  l-«2) 

19.  Principal  minors  are  1,  15/64,  and  —17/32  for  1  x  1,  2  x  2  and  3x3,  respectively. 

Fourier  transform  is  1  —  (7/4)  cos  (27 r/)  which  can  be  negative. 
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20.  nx[n]  =  rx[k]  =  (1/2 )ru[k]  +  (1  /4)rv[k  -  1]  +  (1/4 )ru[k  +  1] 

28.  Pxtf)  =  2^(1  -  cos(2t r/)) 

30.  Px(f)  =  a\°u 

34.  rx[k]  =  a^5[k ]  +  p2,  Px{f )  =  +  n28{f) 

38.  rx[k\  —  9/4, 3/2, 1/2  for  k  =  0, 1, 2,  respectively,  and  zero  otherwise 
40.  a  >  \b\ 

42.  E[X[n]}  =  0,  E[X[  10]]  =  -0.0105,  E[X[  12]]  =  0.0177;  rx[ 2]  =  0.1545, 

E[X[  10]X[12]]  =  0.1501,  E[X[12]X[U]\  =  0.1533  (iVrea l  =  1000) 

44.  Px(f)  —  1?  increasing  AT  does  not  improve  estimate  -  must  average  over  en¬ 
semble  of  periodograms 

47.  2(exp(— 10)  —  exp(— 100)) 

50.  nx(t)  =  0,  var(X(£))  =  AT0/(2T),  no 

51.  vax(ftN)  =  (1/N)  Zk=-{N-i)^  ~  \k\/N)N0Wsm(vk/2)/(xk/2)  for  N  =  20. 

It  is  0.9841  times  that  of  the  variance  of  the  sample  mean  for  Nyquist  rate 
sampled  data. 

Chapter  18 

1.  rx[k]  =  3  for  k  =  0,  rx[k]  =  —  1  for  k  =  ±2,  and  equals  zero  otherwise;  Px(f)  — 
3  —  2cos(47t/) 

4.  b\  =  0,  62  =  -1 

7.  rx [fc]  =  3  for  k  =  0,  rx[k]  =  —2  for  k  —  ±1,  rx [&]  =  1/2  for  k  =  ±2,  and  equals 
zero  otherwise;  Px(f)  =  3  —  4cos(27r/)  +  cos(47 r/) 

13.  Hopt  =  (2  -  2cos(2tt/))/(3  -  2cos(27t/));  rnsemin  =  0.5552 
18.  msemin  =  rx [0](1  —  Px[n0],A:[no+i]) 

22.  ^[no  +  1]  =  —  [6(1  +  62)/(l  +  b2  +  6^)]Jf[uo]  —  [62/(l  +  b2  +  64)].?f[no  —  1]; 
msemjn  =  1  +  66/(l  +  b2  +  64) 

24.  msemin  -  1  +  [66/(l  +  b2  +  64)]  =  85/84  =  1.0119,  msemin  =  1.0117  (iVrea 1  = 
10, 000) 

27.  X[n0]  -  [a/(l  +  a2)](X[n0  +  1]  +  X[n0  -  1]) 

29.  PX{F)  =  (NqT2 /2)[(sm(nFT)) / (-irFT)]2 
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32.  rx(0)  =  No/ (ARC),  no 

Chapter  19 

1.  E[X[n]Y[n  +  Arj]  =  (-1  )n+ka^6[k},  no 

5.  rx,y[k]  =  0 

6.  rx,u[k]  =  0  for  k  >  0  and  rx,y[k]  =  (l/2)(_fc)  for  k  <  0 
10.  \Px,yU)\  =  V5  +  4cos(2tt/),  ZPx,Y(f )  =  arctan 

12.  rz[k]  =  rx[k]-rx,Y[k]-rY,x[k]  +  ry[k],  Pz{f )  =  Px(f)  —  Px,Y{f)  —  Py,x(f)  + 

Py(f) 

A 

15.  7 x,y(f)  =  “1 5  perfectly  predictable  using  F[no]  =  —  X[n<j] 

18.  Hopt(f)  =  Px,y(f)/Px(f) 

23.  tx,y(t)  =  Nq/(2T)  for  0  <  r  <  T  and  zero  otherwise 
26.  rx,y[k ]  =  £[&]  —  bS[k  —  1],  for  b  =  —  1 


Jfe 

rx,y  [A:] 

rx,Y  [A:] 

k 

rx,y[k] 

rx,Y  [A:] 

-5 

0 

-0.0077 

0 

1 

0.9034 

-4 

0 

-0.0242 

1 

1 

0.9031 

-3 

0 

0.0259 

2 

0 

-0.0064 

-2 

0 

0.0004 

3 

0 

-0.0007 

-1 

0 

-0.0062 

4 

0 

0.0267 

5 

0 

-0.0238 

Chapter  20 

2.  1/4 

5.  Y  =  [y[0]  y[l]]T  ~  0,  Cy),  where 


2 


not  independent 

10.  WSS  with  =  pxpy,  Pz{f )  =  Px{f)  *  Py{f) 


14.  1 
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17.  T  =  66, 347 


19.  2Q(1/VT) 

22.  Y  =  [Y(0)  Y(1/4)]t  ~  .AZ^O,  Cy),  where 


2  \/2 

y/2  2 


25.  Pu(F )  =  Py(F)  =  8(1  —  |F|/10)  for  |F|  <  10  and  zero  otherwise 

30.  A'  [n]  =  U[n]  —  U[n  —  1],  where  U[n]  is  WGN  with  =  1 

31.  rx[0]  =  2,  rx[0]  =  1-9591;  rx[  1]  =  -1,  r*[l]  =  -0.9614;  r*[2]  =  0,  rx[2]  = 

-0.0195;  rx[3]  =  0,  fx[ 3]  =  -0.0154 

Chapter  21 

3.  probability  =  0.1462,  average  =  5 

7.  A  =  2,  A  =  1.9629;  A  =  5,  A  =  4.9072  (based  on  10,000  arrivals  with  A  = 

mm/t) 

10.  E[N(t2)  -  N(h)]  =  var (N(t2)  -  N(t i))  =  A (t2  -  h), 

13.  0.6321 


17.  10  minutes 

20.  P[T2  <  1]  =  1  -  2exp(-l)  =  0.2642,  P[T2  <  1]  =  0.2622  (lVrea i  =  10,000) 

23.  Xto(2p  —  1) 

Chapter  22 
2.  1/128 

5.  P[Y[ 2]  =  1\Y[1]  =  1,  Y[0]  =  0]  =  1  -p,  P[Y[2]  =  1\Y[1]  =  1]  =  1/2  for  all  p 
9.  Pe  =  0.3362 

11.  P[red  drawn]  =  1/3 

12.  1/2 

14.  yes,  7tt  =  [5  §  0  0] 

19.  ttt  =  [0.2165  0.4021  0.3814] 

24.  P[rain]  =  0.6964 
26.  n  =  6 

28.  7Ti  =  1/3,  fri  =  0.3240  (based  on  playing  1000  holes) 


Index 


Abbreviations,  782 

ACF  (see  Autocorrelation  function) 

ACS  (see  Autocorrelation  sequence) 

Affine  transformation,  313,  400 
AR  (see  Autoregressive) 

ARM  A  (see  Autoregressive  moving  average) 
Arrival  angle  measurement,  671 
Arrival  rate  of  Poisson  process,  713 
Arrival  times  of  Poisson  process,  721,  734 
Autocorrelation  function: 
definition,  581 
properties,  581,  624 
Autocorrelation  matrix,  561 
Autocorrelation  method  of  LPC,  628 
Autocorrelation  sequence: 
definition,  552 
properties,  553 
estimator,  577,  627 
LSI  system  output,  602,  604 
MATLAB  code  for  estimation,  578 
for  deterministic  signal,  798 
Autoregressive: 

definition,  558,  633,  681 
autocorrelation  sequence,  560 
power  spectral  density,  572 
linear  prediction,  618 
generation  of  realization,  592,  595 
for  modeling  of  PSD,  588,  628 
cross-correlation  of  filter 
input /output,  668 
Autoregressive  moving  average,  681 
Auxiliary  random  variable,  183,  251,  403 
Averaged  periodogram: 
definition,  579 
MATLAB  code  for,  580 
Averages: 

ensemble,  563 
temporal,  563 

Average  power,  553,  575,  582 


Average  value  (see  Expected  value) 

Bandpass  random  process: 
definition,  691 
representation,  692 
“white”  Gaussian  noise,  694 
Bayes’  theorem,  86 

Bayes’  theorem  for  conditional  PDF,  223 
Bernoulli  law  of  large  numbers,  490 
Bernoulli  trial: 
independent,  90 
dependent,  94 
Markov  chain,  739 

Bernoulli  probability  law  (mass  function),  111 
Bernoulli  random  process,  517 
Bernoulli  sequence,  90 
Binomial  coefficient  (see  Combinations) 
Binomial  counting  random  process,  526 
Binomial  probability  law  (mass  function): 
definition,  4,  63,  91,  112 
maximum  value,  129 
approximation  by  Poisson,  113,  152,  712 
approximation  by  Gaussian,  501 
Binomial  theorem,  11,  68,  71 
Bins,  21 

Birthday  problem,  57 
Bivariate  Gaussian  PDF: 
definition,  401,  408 
standardized  version,  385 
Bivariate  PMF  (see  Joint  PMF) 

Body  mass  index,  202 
Boole’s  inequality,  52 
Box-Mueller  transform,  431 
Brownian  motion  (see  Wiener 
random  process) 

Cartesian  product  of  sets,  89 

Cauchy  PDF,  299,  402 

Cauchy-Schwarz  inequality,  197,  213,  645 
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Causal  signal,  796 

Causal  system,  605,  801,  807 

CCF  (see  Cross-correlation  function) 

CCS  (see  Cross-correlation  sequence) 

CDF  (see  Cumulative  distribution  function) 
Central  limit  theorem: 
description,  495,  497,  501 
proof,  513 

using  to  compute  binomial 
probabilities,  501 
using  to  compute  chi-squared 
probabilities,  497 

used  as  justification  for  modeling,  674 
Central  moments,  146,  190 
Certain  set,  44 

Chain  rule  (probability),  85,  266 
Chain  rule  for  Markov  chains,  745 
Channel  delay  measurement,  658 
Chapman-Kolmogorov  equations,  748 
Characteristic  function: 
scalar  definition,  147,  359 
joint  definition,  198,  265,  414,  467 
properties,  151 
finding  PMF  from,  185 
finding  PDF  from,  361 
finding  moments  from,  149,  199,  266,  468 
convergence  of  sequence,  152,  361 
table  of,  145,  355 
for  multivariate  Gaussian,  481 
Chebyshev  inequality,  362 
Chi-squared  PDF,  302,  429,  480,  498,  509 
Cholesky  decomposition,  417,  475 
Clutter,  674 
Coherence  function,  649 
Combinations,  60 
Combinatorics,  54 
Communications : 

phase  shift  keying,  24 
channel  modeling,  104 
model  of  link,  81 
error  probability,  308 
Complement  set,  39 
Compound  Poisson  random  process: 
definition,  723 
characteristic  function,  724 
mean  function,  726 
variance  function,  735 
Computer  data  generation: 


discrete  random  variables,  122 
discrete  random  vectors,  200 
conditioning  approach,  235,  447 
given  covariance  matrix,  283 
AR  random  process,  592,  595,  633 
Cauchy,  368 
exponential,  366 
Gaussian,  18,  27,  33,  431 
bivariate  Gaussian,  416,  448 
multivariate  Gaussian,  475 
Gaussian  from  sum  of  uniforms,  481 
Uniform,  26,  33 
Laplacian,  326 
WSS  with  given  PSD,  696 
continuous-time  Wiener 
random  process,  704 
Poisson  random  process,  727 
Markov  chain,  764 
Conditional  expectation: 
definition,  229,  446 
properties,  233,  447 
Conditional  mean 

(see  Conditional  expectation) 
Conditional  probability,  75 
Conditional  PDF: 
definition,  437 
properties,  440 
of  bivariate  Gaussian,  439 
Conditional  PMF: 
definition,  218,  266 
properties,  222 
and  independence,  224 
and  Markov  property,  745 
Confidence  interval,  504 
Continuity  theorem  for  characteristic 
functions,  152,  361 
Contours  of  constant  PDF,  382,  386 
Convergence  in  probability,  491 
Convergence  of  random  variables,  507 
Convolution: 

integral,  396,  493,  624,  685,  806 
sum,  200,  599,  798 
to  compute  PMF,  184 
to  compute  PDF,  493 
MATLAB  program  to  compute,  511 
Correlation: 
definition,  196 
and  causality,  197 
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Correlation  coefficient: 
definition,  196 
for  prediction,  196 
estimation  of,  210 

for  standard  bivariate  Gaussian,  407 
for  random  processes,  555,  649 
Correlation  of  signals,  798 
Correlation  time,  695 
Countable  set,  43 
Countably  infinite  set,  43 
Counting  methods,  37,  54 
Covariance: 
definition,  188 
properties,  208 

of  independent  random  variables,  191 
estimation  of,  211,  545 
for  bivariate  Gaussian  PDF,  406 
Covariance  function,  533 
Covariance  matrix: 
definition,  258,  281 
properties,  258,  561 
for  uncorrelated  random 
variables,  259,  465 
diagonalization,  260 
eigenanalysis,  261,  431 
estimation  of,  270 
for  Gaussian  PDF,  408,  459 
Covariance  sequence,  533 
CPSD  (see  Cross-power  spectral  density) 
Cross-correlation  function: 
definition,  657 
properties,  657 
Cross-correlation  sequence: 
definition,  643 
properties,  645 
estimation  of,  662 
MATLAB  program  for 
estimation,  662 
for  deterministic  signals,  798 
Cross-power  spectral  density: 
discrete-time  definition,  647,  648 
continuous- time  definition,  657 
properties,  650 
at  input /output  of  filter,  654 
Cross-spectral  matrix,  669 
CTCV  (see  Continuous-time/continuous¬ 
valued  under  random  process) 
CTDV  (see  Continous-time/discrete- 


valued  under  random  process) 
Cumulative  distribution  function: 
discrete  random  variable,  118,  250 
continuous  random  variable,  303,  389,  463 
conditional,  441,  449 
mixed  random  variable,  319 
D/A  (see  Digital-to-analog) 

Data  generation  (see  Computer 
data  generation) 

Data  compression  via  Markov  chain,  767 
Data  compression  via  source  encoding,  155 
dB  (  see  Decibel) 

DC  (see  Direct  current) 

Decibel,  629 
Decorrelation  of 

random  variables,  208,  260,  410 
Demodulation,  705 
DeMoivre-Laplace  theorem,  501 
De  Morgan’s  laws,  42 
Dependent  subexperiments,  94 
DFT  (see  Discrete  Fourier  transform) 
Differences  552,  681 
Digital-to-analog  convertor,  629 
Dirac  delta  function,  319,  336,  787,  804 
Direct  current  (DC  level)  signal,  477,  566 
Discrete  Fourier  transform: 
definition,  799 
use  in  spectral  analysis,  580 
to  approximate  inverse 

Fourier  transform,  616,  799 
Disjoint  sets,  41,  44,  67 
Dow-Jones  industrial  average,  517 
DTCV (see  Discrete-time/continuous¬ 
valued  under  random  process) 
DTDV(see  Discrete-time/discrete¬ 
valued  under  random  process) 

Dyad  (see  Outer  product) 

Eigenanalysis  (eigendecomposition) : 
definition,  261,  793 
used  to  find  powers  of  matrix,  750 
Eigenvalue/eigenvector,  793 
Empty  set,  38 

Ensemble  average  for  Markov  chain,  563 
Entropy,  157 

(see  also  Real-world  example- 
data  compression) 

Envelope  of  random  process,  693,  705 
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Ergodic: 

definition,  564 
mean,  564 
autocorrelation,  577 
Markov  chain,  752,  761 
Erlang  PDF,  302,  721 
Estimation: 

autocorrelation  sequence,  577 
autoregressive  parameters,  623 
CDF,  123 

correlation  coefficient,  210 
covariance  matrix,  270 
cross-correlation  sequence,  662 
interval,  20 
least  squares,  538,  546 
mean,  21,  154,  364 
moments,  154,  202 
PDF,  14,  19 
PMF,  122,  202 

power  spectral  density,  568,  579 
probability  of  heads,  7,  19 
probability  of  error,  27 
variance,  154,  364 
Even  function,  334,  368,  787,  805 
Even  sequence,  554,  796 
Events: 

definition,  44 
impossible,  44 
joint,  76,  458 
simple,  44 

zero  probability,  48,  53,  384 
Expected  value: 
scalar,  136,  345 
vector,  255,  459 
matrix,  260,  281 
center  of  mass  analogy,  158,  346 
estimation  of,  154,  364 
nonexistence  of,  139,  159,  348 
properties,  159,  160 
of  sum,  187,  405,  465 
of  product,  187,  405 
for  function  of,  140,  187,  351,  405 
table  of,  145,  355 
of  conditional  PMF,  229 
of  conditional  PDF,  446 
and  conditional  expectation,  230 
using  conditioning,  234,  447 
bounds  on,  368 


from  CDF,  370 
Experiments: 
description,  1 
subexperiments,  89 
independence  of  subexperiments,  89 
dependence  of  subexperiments,  94 
Exponential  function: 
as  limit,  784 
as  infinite  series,  785 
Exponential  PDF,  296,  302 
Extrapolation  (see  Prediction) 

Factorial: 

definition,  4,  57 

computing,  72 

and  Gamma  function,  301 

Fast  Fourier  transform,  33,  579,  615 

* 

FFT  (see  Fast  Fourier  transform) 
Filtering: 
all-pole,  627 
bandpass,  694 
digital  filter  design,  696 
interference  rejection,  624 
of  jointly  WSS  random  processes,  653 
lowpass,  25,  659,  804,  808 
passband,  656,  804,  808 
stopband,  804,  808 
Wiener,  613 

Finite  dimensional  distribution,  523 
Finite  impulse  response  filter,  600,  803 
Finite  set,  43 

FIR  (see  Finite  impulse  response  filter) 
Fish  population  measurement,  698 
Forecasting  (see  Prediction) 

Fourier  transform: 

discrete-time  Fourier  transform: 
definition,  796,  798 
table  of,  797 

continuous-time  Fourier  transform: 
definition,  805 
table  of,  806 

as  narrowband  filter,  670 
Frequency  response,  600,  802,  808 
Functions: 

even,  334,  368,  787 
hermitian,  650 
monotone,  337 
odd,  368,  787 
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Fundamental  theorem  of  calculus,  310,  786 

Gamma  function,  300 
Gamma  PDF,  300 
Gaussian  PDF,  5,  20,  296,  459 
Gaussian  mixture  PDF,  332,  371 
Gaussian  random  variable,  296 
Gaussian  random  process: 
definition,  677 
linear  filtering  of,  682 
(see  also  Bandpass  random  process) 
Geometric  probability  law 
(mass  function),  91,  113,  720 
Geometric  series,  47,  785 

Histogram,  14 

Homogeneous  transition  probabilities,  745 
Huffman  code,  156 
Hypergeometric  probability  law 
(mass  function),  63 

IID  (  see  Independent  and 
identically  distributed) 

HR  (  see  Infinite  impulse  response  filter) 
Image  coding,  272 
Image  signal  processing,  419 
Importance  sampling,  364 
Impossible  event,  44 
Impulse  (discrete-time),  795 
Impulses  (for  continuous-time 
see  Dirac  delta  function) 

Impulse  response,  599,  801,  807 
Increment  of  random  process: 
definition,  526 
independent,  526 
stationary,  526 

for  Poisson  random  process,  714 
for  Wiener  random  process,  679,  687 
Independent  events,  78,  83 
Independent  increments,  526 
Independent  and  identically 
distributed,  466,  678 
Independent  random  variables: 
definition,  24,  179 
factorization  of  CDF,  392 
factorization  of  PDF,  392,  462 
factorization  of  PMF,  179,  224 
factorization  of  characteristic 


function,  282,  468 
transforming  Gaussian  random 
variables,  410 
functions  of,  199 
Independent  subexperiments,  90 
Indicator  random  variable,  353 
Induction,  proof  by,  783 
Infinite  impulse  response  filter,  803 
Infinite  set,  43 
Inner  product,  791,  798 
In-phase  component,  693 
Integration  by  parts,  368,  787 
Integration  using  approximating  sum,  12,  786 
Interarrival  times  of  Poisson  process,  718 
Interference  suppression,  624 
Interpolation,  611 
Intersection  of  sets,  39 
Inverse  probability  integral 
transformation,  324 
Inverse  Q  function,  308 
Irreducible  Markov  chain,  756 

Joint  CDF: 

discrete-time  random  variables,  177 
continuous-time  random  variables,  389 
computing  probability  from,  391 
Joint  moments,  189,  192,  266,  412 
Joint  probability,  76 

Jointly  distributed  random  variables,  170,  379 
Jointly  WSS,  642 
Joint  PDF: 

definition,  381,  458 
bivariate  Gaussian,  385 
from  joint  CDF,  391 
Joint  PMF: 
definition,  171 
estimation  of,  202 
Joint  sample  space,  170 

Karhunen-Loeve  transform,  273 

Laplacian  PDF,  298 

Law  of  averages  (see  Law  of  large  numbers) 

Law  of  large  numbers,  491 

Least  squares,  538,  546 

Leibnitz  rule,  335,  787 

Linear  predictive  coding,  628 

Linear  prediction  equations,  618 
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( see  also  Prediction) 

Linear  shift  invariant  system: 
definition,  599,  800 

effect  on  WSS  random  process,  602,  654 
described  by  difference  equation,  605,  802 
Linear  systems  (see  Appendix  D) 

Linear  time  invariant  system: 
definition,  623,  807 
effect  on  WSS  random  process,  624 
Linear  transformation: 

of  Gaussian  random  variable,  314 
joint  PDF,  399 

of  Gaussian  random  vector,  399,  408,  464 
Line  fitting,  538 
Lowpass  filter  (see  Filtering) 

Lowpass  random  process,  693 
LPC  (see  Linear  predictive  coding) 

LSI  (  see  Linear  shift  invariant  system) 

LTI  (see  Linear  time  invariant  system) 

MA  (see  Moving  average) 

Magnitude  response  of  linear 
system,  802,  808 
Mappings: 

one-to-one,  108 
many-to-one,  108 
Marginal  CDF,  392 
Marginal  characteristic  function,  265 
Marginal  PDF,  387,  441,  461 
Marginal  PMF,  174,  224,  249 
Marginal  probability,  76 
Markov  chain,  96,  742 
Markov  property,  267,  735 
Markov  random  process,  715,  739 
Markov  sequence,  95 
Markov  state  probability,  742 
MATLAB  overview,  31 
MATLAB  programs: 

simulation  of  heads  for  four  coin  tosses,  7 
simulation  of  three  random  outcomes,  18 
generation  of  Gaussian  noise,  18 
estimated  PDF,  21 
estimated  probability,  22 
estimated  mean,  22 
estimated  mean  of  squared  Gaussian 
random  variable,  23 
estimated  probability  of  error  in  digital 
communication  system,  27 


tutorial  MATLAB  program,  35 
simulation  of  birthday  problem,  59 
generation  of  discrete  random  variable,  122 
general  program  for  discrete 

random  variable  generation,  165 
generation  of  multiple  discrete 
random  variables,  201 
simulation  of  die  experiment 

(dependent  Bernoulli  trials),  232 
generation  of  discrete  random  vector 
using  conditioning,  236 
decorrelation  of  random  vector,  271 
simulation  of  repeated  coin  tossing 
experiment,  290 

estimation  of  PDF  using  histogram,  328 
calculation  of  Q  function,  341 
calculation  of  inverse  Q  function,  341 
calculation  of  tail  probability,  366 
generation  of  Gaussian  random  vector  and 
estimation  of  mean 
and  covariance,  418 
generation  of  Gaussian  random  vector 
using  conditioning,  448 
generation  of  Gaussian  random  vector 
using  Cholesky  decomposition,  476 
demonstration  of  central  limit  theorem,  511 
simulation  of  nonstationary  random 
processes,  524 

generation  of  MA  random  process,  529 
scatter  diagram  for  randomly 
phased  sinusoid,  537 
line  fitting  of  summer  rainfall,  541 
generation  of  AR  random  process,  559 
estimation  of  ACS  of  AR 
random  process,  578 
averaged  periodogram 
spectral  estimator,  580 
Wiener  smoother,  615 
AR  and  periodogram 
spectral  estimators,  629 
cross-correlation  estimator,  662 
reverberation  spectral  estimation,  706 
reverberation  simulation,  709 
Poisson  random  process  simulation,  727 
three-state  Markov  chain  simulation,  764 
Sierpinski  triangle,  766 
Matrix: 

autocorrelation,  561 
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Monty  Hall  problem,  81 
Moving  average: 
definition,  528,  681 
general  form,  544 

miscellaneous,  534,  557,  564,  606,  621, 
651,  678,  697 

MSE  (see  Mean  square  error) 

Multinomial  coefficient,  93,  103 
Multinomial  probability  law 
(mass  function),  93,  249 
Multinomial  theorem,  249,  278 
Multipath  fading  (see  Rayleigh 
fading  sinusoid) 

Mutually  exclusive  events,  44,  84 

,  280, 

Narrowband  representation  (see  Bandpass 
random  process) 

n-step  transition  probability  matrix,  748 
Nonstationarity,  526,  689,  711,  746 
Normal  (see  Gaussian) 

Notational  conventions,  777 
Null  set  (see  Empty  set) 

Nyquist  rate  sampling,  584 

Odd  function,  368,  787,  805 
Odd  sequence,  796 
Odds  ratio,  87,  97 
Opinion  polling,  503 
Optical  character  recognition,  419 
Outcome,  38,  44,  518  (see  also  Realization) 
Outer  product,  792 
Orthogonality,  791 
720  Orthogonality  principle: 
definition,  474 
for  linear  prediction,  474 
geometric  interpretation,  474,  482 


cofactor,  790 
conformable,  789 
determinant,  790,  792 
diagonal,  791 

doubly  stochastic,  764,  773 
eigenanalysis,  793 
inversion: 

definition,  790 
formula,  792,  794 
irreducible,  756 
minor,  790 
modal,  261 

orthogonal,  261,  282,  791 
partitioned,  791,  792 
positive  definite  (semidefinite) ,  259 
790,  793 

rotation,  264,  282,  430 
square,  790 
stochastic,  756 
symmetric,  790 
Toeplitz,  623 
transpose,  790,  792 
Mean  of  random  variable  (see 
Expected  value) 

Mean  function,  533,  581 
Mean  recurrence  time,  761 
Mean  sequence,  533 
Mean  square  error: 

definition,  142,  193,  208,  471 
estimator,  194 
(see  also  Prediction) 

Median,  333,  450 
Memory  less  property  of  Poisson  process, 
Mixed  random  variables,  317,  354 
Mode,  369 

Modulation  property  of  Fourier 
transform,  797,  806 
Moments: 

definition,  146,  355,  467 
exponential  PDF,  357,  359 
geometrical  PMF,  149 
multivariate  Gaussian,  468,  684 
table  of,  145,  355 
central  vs.  noncentral,  146,  358 
from  characteristic  function,  149,  199 
existence  of,  161 

Monotonically  increasing  function,  337 
Monte  Carlo  method,  13,  365 


Parseval’s  theorem,  797,  806 

Partitioning  of  set,  41 

Pascal  triangle,  12 

Pattern  recognition,  421 

PDF  (see  Probability  density  function) 

Periodic  random  process,  591 

Periodogram,  568 

(see  also  Averaged  periodogram) 
Permutations,  57 
Phase  of  random  process,  569 
Phase  response  of  linear  system,  802,  808 


830 


INDEX 


Phase  shift  keying,  24 

PMF  (see  Probability  mass  function) 

Poisson  counting  random  process,  126,  711 

Poisson  probability  law  (mass  function),  113 

Poisson  random  process,  711 

Poles  of  linear  system,  803 

Positive  semidefinite  sequence,  562 

Posterior  PMF,  239 

Posterior  probability,  87 

Power  (average): 

of  random  variable,  553 
from  PSD,  575 
Power  spectral  density: 

continuous-time  definition,  582 
discrete-time  definition,  569,  571 
properties,  573 
AR  random  process,  572 
used  for  physical  modeling,  587,  628 
estimation  of,  568,  579 
MATLAB  code  for  estimation,  580 
at  output  of  LSI  system,  602 
at  output  of  nonlinear  system,  685 
one-sided  version,  576 
physical  interpretation,  608 
of  thermal  noise,  583 
Prediction: 

definition,  142,  471 

linear,  192,  413,  471 

mean  square  error,  142,  208,  276 

nonlinear,  195,  245,  447,  454 

for  bivariate  Gaussian,  413,  447 

of  sinusoid,  544,  634 

of  WSS  random  process,  618,  622 

L-step  for  WSS  random  process,  611 

for  AR  random  process,  618,  634 

for  MA  random  process,  621 

(see  also  Line  fitting) 

Prior  PMF,  238 
Prior  probability,  80,  87 
Probability: 
definition,  44 
axioms,  45 
of  point,  53,  294 
of  union,  49 
monotonicity,  50 
of  interval,  121,  309 
zero  probability  events,  384 
Probability  calculations  via 


conditioning,  444 
Probability  density  function: 
definition,  6,  20,  289,  381,  458 
properties,  293,  458 
Cauchy: 

definition,  299 
from  ratio  of  Gaussians,  402 
chi-squared,  302 
Erlang,  302 

exponential,  293,  296,  302 
gamma,  300 
Gaussian: 
scalar,  296 
bivariate,  401 
multivariate,  459 
Gaussian  mixture,  332,  371 
Laplacian,  298 
normal  (see  Gaussian) 

Rayleigh: 

definition,  302 

from  square  root  of  Gaussians,  403,  690 
table  of,  355 
uniform: 

definition,  290,  295 
from  angle  of  Gaussians,  403,  690 
conditional,  437 
estimation  of,  14,  19 
from  CDF,  310 
mass  analogy,  292 

of  mixed  random  variables,  317,  354 
approximating  by  PMF,  288 
Probability  of  error,  26,  80,  308,  446 
Probability  function,  44 
Probability  integral  transformation,  324 
Probability  mass  function: 
definition,  109,  171 
Bernoulli,  111 
Binomial,  112 
Geometric,  113 
Poisson,  113 
table  of,  145 
mass  analogy,  130 
estimation  of,  122,  202 
using  Dirac  delta  function,  324 
Probability  of  point,  53 
Problem  designations,  8 
Projection  theorem,  474 
PSD  (see  Power  spectral  density) 
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Pseudorandom  noise,  629 
PSK  (see  Phase  shift  keying) 

Q  function: 
definition,  306 
approximation  of,  308 
evaluation  of,  341 
Quadratic  form,  280,  790 
Quadrature  component,  693 

Random  number  generator,  7 
Random  process: 
definition,  518 
Bernoulli,  519 
binomial  counting,  526 
Gaussian,  677 
Markov  chain,  744 
Poisson,  711 
sum,  525,  702 
infinite,  520 
semi-infinite,  520 
stationary,  524,  551 
nonstationary,  526 
discrete-time / discrete- valued,  520 
discrete-time / continuous- valued,  520 
continuous-time/ discrete- valued,  520 
cont inuous-time/ continuous- valued,  520 
Random  variable: 
discrete,  17,  107,  169 
continuous,  287,  377 
mixed,  317 
Random  vectors: 

definition,  23,  170,  248,  458 
Gaussian,  459 
Random  walk,  267,  522 
Rayleigh  fading  sinusoid,  403,  689 
Rayleigh  PDF,  302,  403 
Realization,  18,  518 
Real-world  examples: 

digital  communications,  24 
quality  control,  64 
cluster  recognition,  97 
servicing  customers,  124 
data  compression,  155 
assessing  health  risks,  202 
modeling  human  learning,  237 
image  coding,  272 
setting  clipping  levels,  328 


critical  software  testing,  364 
optical  character  recognition,  419 
retirement  planning,  449 
signal  detection,  476 
opinion  polling,  503 
statistical  data  analysis,  538 
random  vibration  testing,  586 
speech  synthesis,  626 
brain  physiology  research,  663 
estimating  fish  populations,  698 
automobile  traffic  signal  planning,  728 
strange  Markov  chain  dynamics,  765 
Relative  frequency,  1,  7,  19,  759 

(see  also  Law  of  large  numbers) 
Replica  correlator,  477 
Reproducting  property,  184,  279,  414 
Reverberation,  674 

Sample  mean: 

definition,  21,  154,  279,  466 
in  detection,  478 
law  of  large  numbers,  490 
(see  also  Ergodic) 

Sample  space: 
general,  44,  518 
discrete,  47,  54 
continuous,  52 
reduced,  75 

numerical  vs.  nonnumerical,  106 
multiple  random  variables,  248,  458 
Sampling  with  (without  replacement),  56 
Scatter  diagram,  24,  537 
Sets,  38 

Sierpinski  triangle,  766 
Signals  (see  Appendix  D) 

Signal  detection,  476 
Signal-to-noise  ratio,  369 
Simple  set,  39 
Sinusoidal  signal: 

continuous-time  definition,  805 
discrete-time  definition,  795 
as  random  process: 

random  phase,  370,  530,  536,  591,  596 
random  amplitude,  544 
random  amplitude/phase,  403 
(see  also  Rayleigh  fading) 

PDF  of  random  phase,  530 
recursive  generation  of,  544 
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prediction  of,  544 
miscellaneous,  557 
Size  of  set,  43 
Smoothing,  Wiener: 
definition,  611 
MATLAB  code  for,  615 
SNR  (see  Signal-to-noise  ratio) 

Source  encoding,  155 

Spectral  estimation,  568,  579 

Spectral  factorization,  696 

Spectral  representation  theorem,  569 

Speech  synthesis,  626 

Stable  linear  system,  801,  807 

Stable  sequence,  796 

Standard  bivariate  Gaussian,  385,  394 

Standard  deviation,  356,  371 

Standardized  random  variable,  195 

Standard  normal  random  variable,  296,  305 

Standardized  sum,  495 

State,  742 

State  occupation  time,  759 
State  probabilities,  742 
State  probability  diagram,  95,  742 
State  transition  matrix,  742 
Stationarity: 
definition,  524 
relation  to  IID,  524 
for  WSS  random  process,  680 
Stationary  increments,  526 
Stationary  probabilities  for  Markov  chain, 
definition,  749,  757 
computing,  750,  762,  771 
Steady-state  probabilities  (see  Stationary 
probabilities  for  Markov  chain) 

Step  function,  805 
Step  sequence,  795 
Stirling’s  formula,  71 
Subexperiments,  89 
Symbols,  777 

System  function,  600,  604,  681,  801 
Sum  of  Poisson  random  processes,  733 
Sum  of  random  number  of 

random  variables,  234,  723 
Sum  of  random  processes,  653 
Sum  of  random  variables: 
finding  PMF,  254 
finding  PDF,  397,  470 
binomial  from  Bernoulli,  254 


finding  variance,  257 
of  uniform  random  variables,  395 
of  Gaussian  random  variables,  414 
of  Poisson  random  variables,  279 

Tail  probability,  305 

(see  also  Importance  sampling) 

Taylor  expansion,  130,  785 
Temporal  average,  563 
Temporal  average  for  Markov  chain,  760 
Total  probability: 

for  probability  of  event,  79 
for  PMFs,  229 
for  PDFs,  445 

Transforms  (see  Appendix  D) 

Transform  coding,  273 
Transformations : 

of  discrete  random  variable,  115 
of  continuous  random 
variable,  22,  313,  316 
of  multiple  random 

variables,  181,  251,  400,  464 
sum  of  random  variables,  184,  397 
general  funcion  of,  464 
using  CDF  approach,  317 
PDF  of  product,  429 
PDF  of  quotient,  402,  429,  445 
using  conditioning  approach,  225 
nonlinear — Gaussian 
random  processes,  684 
Trellis,  96 
Tuples,  55 

Uncorrelated  random  variables: 
definition,  196 
and  independence,  462 
Uncorrelated  WSS  random  processes,  647 
Uncountable  set,  43,  285 
Union,  39 

Union  bound  (see  Boole’s  inequality) 
Uniform  PDF,  290,  295 
Uniform  PMF,  145 
Unit  step  function,  320 
Unit  step  sequence,  183 
Universal  set,  38 

Variance: 

definition,  143,  355 
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properties,  145 
table  of,  145 
estimation  of,  154,  364 
sequence,  533 
function,  533 
of  sum,  188,  257,  465 
conditional,  230 
Venn  diagram,  40 
Vibration  analysis,  586 

Waiting  times  (see  Arrival  times 
of  Poisson  process) 

WGN  (see  White  Gaussian  noise) 

White  Gaussian  noise: 

discrete-time  definition,  528 
continuous-time  definition,  686 
obtaining  discrete- time  from 
continuous-time,  583 
bandpass  version,  584,  694 
miscellaneous,  534,  677 
Whitening  (preemphasis),  661 
White  noise,  556,  569,  571 
Wide  sense  stationary: 
definition,  550 
jointly  distributed,  642 
generating  realization  of,  681 
Wiener  filtering: 
definition,  609 
filtering,  609 
smoothing,  611 
prediction,  611 
interpolation,  611 
Wiener-Hopf  equations,  623 
Wiener-Khinchine  theorem,  571 
Wiener  random  process: 
discrete-time,  679 
continuous-time,  687,  703 
computer  generation  of  realization,  704 
Wolfer  sunspot  data,  548 
WSS  (see  Wide  sense  stationary) 

Yule- Walker  equations,  623 


^-transform,  800 


